Lesson 7c: Feature Engineering#
Data preprocessing and engineering techniques generally refer to the addition, deletion, or transformation of data. The time spent on identifying data engineering needs can be significant and requires you to spend substantial time understanding your data…or as Leo Breiman said “live with your data before you plunge into modeling” (Breiman 2001).
In this lesson, we start introducing a few fundamental feature engineering tasks that can:
improve the performance of your numerical features,
allow you to include non-numeric features in your modeling,
create a scikit-learn pipeline to chain preprocessing and model training steps.
Learning objectives#
By the end of this lesson you’ll be able to:
Standardize numeric features
Pre-process nominal and ordinal features
Create a scikit-learn pipeline to chain together feature engineering and model training steps
Combine numeric and non-numeric feature engineering steps
Basic prerequisites#
Let’s go ahead and import a couple required libraries and import our data.
Note
We will import additional libraries and functions as we proceed but we do so at the time of using the libraries and functions as that will help you to connect specific steps to particular modules/functions imported.
import pandas as pd
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')
# import data
adult_census = pd.read_csv('../data/adult-census.csv')
# separate feature & target data
target = adult_census['class']
features = adult_census.drop(columns='class')
Selection based on data types#
Typically, data types fall into two categories:
Numeric: a quantity represented by a real or integer number.
Categorical: a discrete value, typically represented by string labels (but not only) taken from a finite list of possible choices.
features.dtypes
age int64
workclass object
education object
education-num int64
marital-status object
occupation object
relationship object
race object
sex object
capital-gain int64
capital-loss int64
hours-per-week int64
native-country object
dtype: object
Note
Do not take dtype output at face value! It is possible to have categorical data represented by numbers (i.e. education_num
. And object
dtypes can represent data that would be better represented as continuous numbers (i.e. dates).
Bottom line, always understand how your data is representing your features!
We can separate categorical and numerical variables using their data types to identify them.
There are a few ways we can do this. Here, we make use of make_column_selector
helper to select the corresponding columns.
from sklearn.compose import make_column_selector as selector
# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)
# results in a list containing relevant column names
numerical_columns
['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
Preprocessing numerical data#
Scikit-learn works “out of the box” with numeric features. However, some algorithms make some assumptions regarding the distribution of our features.
We see that our numeric features span across different ranges:
numerical_features = features[numerical_columns]
numerical_features.describe()
age | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|
count | 48842.000000 | 48842.000000 | 48842.000000 | 48842.000000 | 48842.000000 |
mean | 38.643585 | 10.078089 | 1079.067626 | 87.502314 | 40.422382 |
std | 13.710510 | 2.570973 | 7452.019058 | 403.004552 | 12.391444 |
min | 17.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 28.000000 | 9.000000 | 0.000000 | 0.000000 | 40.000000 |
50% | 37.000000 | 10.000000 | 0.000000 | 0.000000 | 40.000000 |
75% | 48.000000 | 12.000000 | 0.000000 | 0.000000 | 45.000000 |
max | 90.000000 | 16.000000 | 99999.000000 | 4356.000000 | 99.000000 |
Normalizing our features so that they have mean = 0 and standard deviation = 1, helps to ensure our features align to algorithm assumptions.
Tip
Here are a couple reasons for scaling features:
Models that rely on the distance between a pair of samples, for instance k-nearest neighbors, should be trained on normalized features to make each feature contribute approximately equally to the distance computations.
Many models such as logistic regression use a numerical solver (based on gradient descent) to find their optimal parameters. This solver converges faster when the features are scaled.
Whether or not a machine learning model requires normalization of the features depends on the model family. Linear models such as logistic regression generally benefit from scaling the features while other models such as tree-based models (i.e. decision trees, random forests) do not need such preprocessing (but will not suffer from it).
We can apply such normalization using a scikit-learn transformer called StandardScaler
.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(numerical_features)
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
The fit
method for transformers is similar to the fit
method for
predictors. The main difference is that the former has a single argument (the
feature matrix), whereas the latter has two arguments (the feature matrix and the
target).
Fig. 18 transformer.fit
method representation.#
In this case, the algorithm needs to compute the mean and standard deviation for each feature and store them into some NumPy arrays. Here, these statistics are the model states.
Note
The fact that the model states of this scaler are arrays of means and
standard deviations is specific to the StandardScaler
. Other
scikit-learn transformers will compute different statistics and store them
as model states, in the same fashion.
We can inspect the computed means and standard deviations.
scaler.mean_
array([ 38.64358544, 10.07808853, 1079.06762622, 87.50231358,
40.42238238])
scaler.scale_
array([1.37103696e+01, 2.57094644e+00, 7.45194277e+03, 4.03000427e+02,
1.23913172e+01])
Tip
Scikit-learn convention: if an attribute is learned from the data, its name
ends with an underscore (i.e. _
), as in mean_
and scale_
for the
StandardScaler
.
Once we have called the fit
method, we can perform the data transformation by
calling the method transform
.
numerical_features_scaled = scaler.transform(numerical_features)
numerical_features_scaled
array([[-0.99512893, -1.19725891, -0.14480353, -0.2171271 , -0.03408696],
[-0.04694151, -0.41933527, -0.14480353, -0.2171271 , 0.77292975],
[-0.77631645, 0.74755018, -0.14480353, -0.2171271 , -0.03408696],
...,
[ 1.41180837, -0.41933527, -0.14480353, -0.2171271 , -0.03408696],
[-1.21394141, -0.41933527, -0.14480353, -0.2171271 , -1.64812038],
[ 0.97418341, -0.41933527, 1.87131501, -0.2171271 , -0.03408696]])
Let’s illustrate the internal mechanism of the transform
method and put it
to perspective with what we already saw with predictors.
Fig. 19 transformer.transform
method representation.#
The transform
method for transformers is similar to the predict
method
for predictors. It uses a predefined function, called a transformation
function, and uses the model states and the input data. However, instead of
outputting predictions, the job of the transform
method is to output a
transformed version of the input data.
Finally, the method fit_transform
is a shorthand method to call
successively fit
and then transform
.
Fig. 20 transformer.fit_transform
method representation.#
# fitting and transforming in one step
scaler.fit_transform(numerical_features)
array([[-0.99512893, -1.19725891, -0.14480353, -0.2171271 , -0.03408696],
[-0.04694151, -0.41933527, -0.14480353, -0.2171271 , 0.77292975],
[-0.77631645, 0.74755018, -0.14480353, -0.2171271 , -0.03408696],
...,
[ 1.41180837, -0.41933527, -0.14480353, -0.2171271 , -0.03408696],
[-1.21394141, -0.41933527, -0.14480353, -0.2171271 , -1.64812038],
[ 0.97418341, -0.41933527, 1.87131501, -0.2171271 , -0.03408696]])
Notice that the mean of all the columns is close to 0 and the standard deviation in all cases is close to 1:
numerical_features = pd.DataFrame(
numerical_features_scaled,
columns=numerical_columns
)
numerical_features.describe()
age | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|
count | 4.884200e+04 | 4.884200e+04 | 4.884200e+04 | 4.884200e+04 | 4.884200e+04 |
mean | 2.281092e-16 | -9.208746e-17 | 1.047440e-17 | -1.018345e-17 | 4.466169e-17 |
std | 1.000010e+00 | 1.000010e+00 | 1.000010e+00 | 1.000010e+00 | 1.000010e+00 |
min | -1.578629e+00 | -3.531030e+00 | -1.448035e-01 | -2.171271e-01 | -3.181452e+00 |
25% | -7.763164e-01 | -4.193353e-01 | -1.448035e-01 | -2.171271e-01 | -3.408696e-02 |
50% | -1.198790e-01 | -3.037346e-02 | -1.448035e-01 | -2.171271e-01 | -3.408696e-02 |
75% | 6.824334e-01 | 7.475502e-01 | -1.448035e-01 | -2.171271e-01 | 3.694214e-01 |
max | 3.745808e+00 | 2.303397e+00 | 1.327438e+01 | 1.059179e+01 | 4.727312e+00 |
Model pipelines#
We can easily combine sequential operations with a scikit-learn
Pipeline
, which chains together operations and is used as any other
classifier or regressor. The helper function make_pipeline
will create a
Pipeline
: it takes as arguments the successive transformations to perform,
followed by the classifier or regressor model.
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
model = make_pipeline(StandardScaler(), LogisticRegression())
model
Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
StandardScaler()
LogisticRegression()
Let’s divide our data into train and test sets and then apply and score our logistic regression model:
from sklearn.model_selection import train_test_split
# split our data into train & test
X_train, X_test, y_train, y_test = train_test_split(numerical_features, target, random_state=123)
# fit our pipeline model
model.fit(X_train, y_train)
# score our model on the test data
model.score(X_test, y_test)
0.8136106788960773
Preprocessing categorical data#
Unfortunately, Scikit-learn does not accept categorical features in their raw form. Consequently, we need to transform them into numerical representations.
The following presents typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.
Encoding ordinal categories#
The most intuitive strategy is to encode each category with a different
number. The OrdinalEncoder
will transform the data in such manner.
We will start by encoding a single column to understand how the encoding
works.
from sklearn.preprocessing import OrdinalEncoder
# let's illustrate with the 'education' feature
education_column = features[["education"]]
encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[ 1.],
[11.],
[ 7.],
...,
[11.],
[11.],
[11.]])
We see that each category in "education"
has been replaced by a numeric
value. We could check the mapping between the categories and the numerical
values by checking the fitted attribute categories_
.
encoder.categories_
[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
' HS-grad', ' Masters', ' Preschool', ' Prof-school',
' Some-college'], dtype=object)]
Note
OrindalEncoder
transforms the category value into the corresponding index value of encoder.categories_
However, be careful when applying this encoding strategy: using this integer representation leads downstream predictive models to assume that the values are ordered (0 < 1 < 2 < 3… for instance).
By default, OrdinalEncoder
uses a lexicographical strategy to map string
category labels to integers. This strategy is arbitrary and often
meaningless. For instance, suppose the dataset has a categorical variable
named "size"
with categories such as “S”, “M”, “L”, “XL”. We would like the
integer representation to respect the meaning of the sizes by mapping them to
increasing integers such as 0, 1, 2, 3
.
However, the lexicographical strategy used by default would map the labels
“S”, “M”, “L”, “XL” to 2, 1, 0, 3, by following the alphabetical order.
The OrdinalEncoder
class accepts a categories
argument to
pass categories in the expected ordering explicitly (categories[i]
holds the categories expected in the ith column).
ed_levels = [' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th',
' 12th', ' HS-grad', ' Prof-school', ' Some-college', ' Assoc-acdm',
' Assoc-voc', ' Bachelors', ' Masters', ' Doctorate']
encoder = OrdinalEncoder(categories=[ed_levels])
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[ 6.],
[ 8.],
[11.],
...,
[ 8.],
[ 8.],
[ 8.]])
encoder.categories_
[array([' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th',
' 11th', ' 12th', ' HS-grad', ' Prof-school', ' Some-college',
' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Masters',
' Doctorate'], dtype=object)]
If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (discussed next).
Ecoding nominal categories#
OneHotEncoder
is an alternative encoder that converts the categorical levels into new columns.
We will start by encoding a single feature (e.g. "education"
) to illustrate
how the encoding works.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[0., 1., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]])
Note
sparse_output=False
is used in the OneHotEncoder
for didactic purposes, namely
easier visualization of the data.
Sparse matrices are efficient data structures when most of your matrix elements are zero. They won’t be covered in detail in this workshop. If you want more details about them, you can look at this.
Viewing this as a data frame provides a more intuitive illustration:
feature_names = encoder.get_feature_names_out(input_features=["education"])
pd.DataFrame(education_encoded, columns=feature_names)
education_ 10th | education_ 11th | education_ 12th | education_ 1st-4th | education_ 5th-6th | education_ 7th-8th | education_ 9th | education_ Assoc-acdm | education_ Assoc-voc | education_ Bachelors | education_ Doctorate | education_ HS-grad | education_ Masters | education_ Preschool | education_ Prof-school | education_ Some-college | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
48837 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
48838 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
48839 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
48840 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
48841 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
48842 rows × 16 columns
As we can see, each category (unique value) became a column; the encoding returned, for each sample, a 1 to specify which category it belongs to.
Let’s apply this encoding to all the categorical features:
# get all categorical features
categorical_features = features[categorical_columns]
# one-hot encode all features
categorical_features_encoded = encoder.fit_transform(categorical_features)
# view as a data frame
columns_encoded = encoder.get_feature_names_out(categorical_features.columns)
pd.DataFrame(categorical_features_encoded, columns=columns_encoded).head()
workclass_ ? | workclass_ Federal-gov | workclass_ Local-gov | workclass_ Never-worked | workclass_ Private | workclass_ Self-emp-inc | workclass_ Self-emp-not-inc | workclass_ State-gov | workclass_ Without-pay | education_ 10th | ... | native-country_ Portugal | native-country_ Puerto-Rico | native-country_ Scotland | native-country_ South | native-country_ Taiwan | native-country_ Thailand | native-country_ Trinadad&Tobago | native-country_ United-States | native-country_ Vietnam | native-country_ Yugoslavia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 102 columns
Warning
One-hot encoding can significantly increase the number of features in our data. In this case we went from 8 features to 102! If you have a data set with many categorical variables and those categorical variables in turn have many unique levels, the number of features can explode. In these cases you may want to explore ordinal encoding or some other alternative.
Choosing an encoding strategy#
Choosing an encoding strategy will depend on the underlying models and the type of categories (i.e. ordinal vs. nominal).
Tip
In general OneHotEncoder
is the encoding strategy used when the
downstream models are linear models while OrdinalEncoder
is often a
good strategy with tree-based models.
Using an OrdinalEncoder
will output ordinal categories. This means
that there is an order in the resulting categories (e.g. 0 < 1 < 2
). The
impact of violating this ordering assumption is really dependent on the
downstream models. Linear models will be impacted by misordered categories
while tree-based models will not.
You can still use an OrdinalEncoder
with linear models but you need to be
sure that:
the original categories (before encoding) have an ordering;
the encoded categories follow the same ordering than the original categories.
One-hot encoding categorical variables with high cardinality can cause
computational inefficiency in tree-based models. Because of this, it is not recommended
to use OneHotEncoder
in such cases even if the original categories do not
have a given order.
Using numerical and categorical variables together#
Now let’s look at how to combine some of these tasks so we can preprocess both numeric and categorical data.
First, let’s get our train & test data established:
# drop the duplicated column `"education-num"` as stated in the data exploration notebook
features = features.drop(columns='education-num')
# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)
# split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=123)
Scikit-learn provides a ColumnTransformer
class which will send specific
columns to a specific transformer, making it easy to fit a single predictive
model on a dataset that combines both kinds of variables together.
We first define the columns depending on their data type:
one-hot encoding will be applied to categorical columns.
numerical scaling numerical features which will be standardized.
We then create our ColumnTransfomer
by specifying three values:
the preprocessor name,
the transformer, and
the columns.
First, let’s create the preprocessors for the numerical and categorical parts.
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()
Tip
We can use the handle_unknown parameter to ignore rare categories that may show up in test data but were not present in the training data.
Now, we create the transformer and associate each of these preprocessors with their respective columns.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('one-hot-encoder', categorical_preprocessor, categorical_columns),
('standard_scaler', numerical_preprocessor, numerical_columns)
])
We can take a minute to represent graphically the structure of a
ColumnTransformer
:
Fig. 21 ColumnTransformer
representation.#
A ColumnTransformer
does the following:
It splits the columns of the original dataset based on the column names or indices provided. We will obtain as many subsets as the number of transformers passed into the
ColumnTransformer
.It transforms each subset. A specific transformer is applied to each subset: it will internally call
fit_transform
ortransform
. The output of this step is a set of transformed datasets.It then concatenates the transformed datasets into a single dataset.
The important thing is that ColumnTransformer
is like any other
scikit-learn transformer. In particular it can be combined with a classifier
in a Pipeline
:
model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
model
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']), ('standard_scaler', StandardScaler(), ['age', 'capital-gain', 'capital-loss', 'hours-per-week'])])), ('logisticregression', LogisticRegression(max_iter=500))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']), ('standard_scaler', StandardScaler(), ['age', 'capital-gain', 'capital-loss', 'hours-per-week'])])), ('logisticregression', LogisticRegression(max_iter=500))])
ColumnTransformer(transformers=[('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'), ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']), ('standard_scaler', StandardScaler(), ['age', 'capital-gain', 'capital-loss', 'hours-per-week'])])
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
OneHotEncoder(handle_unknown='ignore')
['age', 'capital-gain', 'capital-loss', 'hours-per-week']
StandardScaler()
LogisticRegression(max_iter=500)
Warning
Including non-scaled data can cause some algorithms to iterate longer in order to converge. Since our categorical features are not scaled it’s often recommended to increase the number of allowed iterations for linear models.
# fit our model
_ = model.fit(X_train, y_train)
# score on test set
model.score(X_test, y_test)
0.8502170174432888
Wrapping up#
Unfortunately, we only have time to scratch the surface of feature engineering in this lesson. However, this module should provide you with a strong foundation of how to apply the more common feature preprocessing tasks.
Tip
Scikit-learn provides many feature engineering options. Learn more here: https://scikit-learn.org/stable/modules/preprocessing.html
Exercises#
Questions:
Using the ames_clean.csv data:
Numeric features:
Select
GrLivArea
andYearBuilt
for features andSalePrice
for the targetCreate a train-test split and use
random_state=123
Create a model pipeline that applies
StandardScaler()
to the features and then applies aLinearRegression
model.What is the score for this model based on the test data?
Adding a categorical feature:
Select
GrLivArea
,YearBuilt
, andNeighborhood
for features andSalePrice
for the targetCreate a train-test split and use
random_state=123
Create a
ColumnTransformer
object that appliesStandardScaler()
to the numeric features andOneHotEncoder
to the categorical feature.Create a model pipeline that combines the
ColumnTransformer
object with aLinearRegression
model.What is the score for this model based on the test data?
Computing environment#
%load_ext watermark
%watermark -v -p jupyterlab,pandas,sklearn
Python implementation: CPython
Python version : 3.12.4
IPython version : 8.26.0
jupyterlab: 4.2.3
pandas : 2.2.2
sklearn : 1.5.1