Lesson 7c: Feature Engineering#

Data preprocessing and engineering techniques generally refer to the addition, deletion, or transformation of data. The time spent on identifying data engineering needs can be significant and requires you to spend substantial time understanding your data…or as Leo Breiman said “live with your data before you plunge into modeling” (Breiman 2001).

In this lesson, we start introducing a few fundamental feature engineering tasks that can:

  • improve the performance of your numerical features,

  • allow you to include non-numeric features in your modeling,

  • create a scikit-learn pipeline to chain preprocessing and model training steps.

Learning objectives#

By the end of this lesson you’ll be able to:

  • Standardize numeric features

  • Pre-process nominal and ordinal features

  • Create a scikit-learn pipeline to chain together feature engineering and model training steps

  • Combine numeric and non-numeric feature engineering steps

Basic prerequisites#

Let’s go ahead and import a couple required libraries and import our data.

Note

We will import additional libraries and functions as we proceed but we do so at the time of using the libraries and functions as that will help you to connect specific steps to particular modules/functions imported.

import pandas as pd

# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')

# import data
adult_census = pd.read_csv('../data/adult-census.csv')

# separate feature & target data
target = adult_census['class']
features = adult_census.drop(columns='class')

Selection based on data types#

Typically, data types fall into two categories:

  • Numeric: a quantity represented by a real or integer number.

  • Categorical: a discrete value, typically represented by string labels (but not only) taken from a finite list of possible choices.

features.dtypes
age                int64
workclass         object
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

Note

Do not take dtype output at face value! It is possible to have categorical data represented by numbers (i.e. education_num. And object dtypes can represent data that would be better represented as continuous numbers (i.e. dates).

Bottom line, always understand how your data is representing your features!

We can separate categorical and numerical variables using their data types to identify them.

There are a few ways we can do this. Here, we make use of make_column_selector helper to select the corresponding columns.

from sklearn.compose import make_column_selector as selector

# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)

# results in a list containing relevant column names
numerical_columns
['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']

Preprocessing numerical data#

Scikit-learn works “out of the box” with numeric features. However, some algorithms make some assumptions regarding the distribution of our features.

We see that our numeric features span across different ranges:

numerical_features = features[numerical_columns]
numerical_features.describe()
age education-num capital-gain capital-loss hours-per-week
count 48842.000000 48842.000000 48842.000000 48842.000000 48842.000000
mean 38.643585 10.078089 1079.067626 87.502314 40.422382
std 13.710510 2.570973 7452.019058 403.004552 12.391444
min 17.000000 1.000000 0.000000 0.000000 1.000000
25% 28.000000 9.000000 0.000000 0.000000 40.000000
50% 37.000000 10.000000 0.000000 0.000000 40.000000
75% 48.000000 12.000000 0.000000 0.000000 45.000000
max 90.000000 16.000000 99999.000000 4356.000000 99.000000

Normalizing our features so that they have mean = 0 and standard deviation = 1, helps to ensure our features align to algorithm assumptions.

Tip

Here are a couple reasons for scaling features:

  • Models that rely on the distance between a pair of samples, for instance k-nearest neighbors, should be trained on normalized features to make each feature contribute approximately equally to the distance computations.

  • Many models such as logistic regression use a numerical solver (based on gradient descent) to find their optimal parameters. This solver converges faster when the features are scaled.

Whether or not a machine learning model requires normalization of the features depends on the model family. Linear models such as logistic regression generally benefit from scaling the features while other models such as tree-based models (i.e. decision trees, random forests) do not need such preprocessing (but will not suffer from it).

We can apply such normalization using a scikit-learn transformer called StandardScaler.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(numerical_features)
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The fit method for transformers is similar to the fit method for predictors. The main difference is that the former has a single argument (the feature matrix), whereas the latter has two arguments (the feature matrix and the target).

transfomer.fit method

Fig. 18 transformer.fit method representation.#

In this case, the algorithm needs to compute the mean and standard deviation for each feature and store them into some NumPy arrays. Here, these statistics are the model states.

Note

The fact that the model states of this scaler are arrays of means and standard deviations is specific to the StandardScaler. Other scikit-learn transformers will compute different statistics and store them as model states, in the same fashion.

We can inspect the computed means and standard deviations.

scaler.mean_
array([  38.64358544,   10.07808853, 1079.06762622,   87.50231358,
         40.42238238])
scaler.scale_
array([1.37103696e+01, 2.57094644e+00, 7.45194277e+03, 4.03000427e+02,
       1.23913172e+01])

Tip

Scikit-learn convention: if an attribute is learned from the data, its name ends with an underscore (i.e. _), as in mean_ and scale_ for the StandardScaler.

Once we have called the fit method, we can perform the data transformation by calling the method transform.

numerical_features_scaled = scaler.transform(numerical_features)
numerical_features_scaled
array([[-0.99512893, -1.19725891, -0.14480353, -0.2171271 , -0.03408696],
       [-0.04694151, -0.41933527, -0.14480353, -0.2171271 ,  0.77292975],
       [-0.77631645,  0.74755018, -0.14480353, -0.2171271 , -0.03408696],
       ...,
       [ 1.41180837, -0.41933527, -0.14480353, -0.2171271 , -0.03408696],
       [-1.21394141, -0.41933527, -0.14480353, -0.2171271 , -1.64812038],
       [ 0.97418341, -0.41933527,  1.87131501, -0.2171271 , -0.03408696]])

Let’s illustrate the internal mechanism of the transform method and put it to perspective with what we already saw with predictors.

transfomer.transform method

Fig. 19 transformer.transform method representation.#

The transform method for transformers is similar to the predict method for predictors. It uses a predefined function, called a transformation function, and uses the model states and the input data. However, instead of outputting predictions, the job of the transform method is to output a transformed version of the input data.

Finally, the method fit_transform is a shorthand method to call successively fit and then transform.

transfomer.fit_transform method

Fig. 20 transformer.fit_transform method representation.#

# fitting and transforming in one step
scaler.fit_transform(numerical_features)
array([[-0.99512893, -1.19725891, -0.14480353, -0.2171271 , -0.03408696],
       [-0.04694151, -0.41933527, -0.14480353, -0.2171271 ,  0.77292975],
       [-0.77631645,  0.74755018, -0.14480353, -0.2171271 , -0.03408696],
       ...,
       [ 1.41180837, -0.41933527, -0.14480353, -0.2171271 , -0.03408696],
       [-1.21394141, -0.41933527, -0.14480353, -0.2171271 , -1.64812038],
       [ 0.97418341, -0.41933527,  1.87131501, -0.2171271 , -0.03408696]])

Notice that the mean of all the columns is close to 0 and the standard deviation in all cases is close to 1:

numerical_features = pd.DataFrame(
    numerical_features_scaled,
    columns=numerical_columns
)

numerical_features.describe()
age education-num capital-gain capital-loss hours-per-week
count 4.884200e+04 4.884200e+04 4.884200e+04 4.884200e+04 4.884200e+04
mean 2.281092e-16 -9.208746e-17 1.047440e-17 -1.018345e-17 4.466169e-17
std 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00 1.000010e+00
min -1.578629e+00 -3.531030e+00 -1.448035e-01 -2.171271e-01 -3.181452e+00
25% -7.763164e-01 -4.193353e-01 -1.448035e-01 -2.171271e-01 -3.408696e-02
50% -1.198790e-01 -3.037346e-02 -1.448035e-01 -2.171271e-01 -3.408696e-02
75% 6.824334e-01 7.475502e-01 -1.448035e-01 -2.171271e-01 3.694214e-01
max 3.745808e+00 2.303397e+00 1.327438e+01 1.059179e+01 4.727312e+00

Model pipelines#

We can easily combine sequential operations with a scikit-learn Pipeline, which chains together operations and is used as any other classifier or regressor. The helper function make_pipeline will create a Pipeline: it takes as arguments the successive transformations to perform, followed by the classifier or regressor model.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())
model
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let’s divide our data into train and test sets and then apply and score our logistic regression model:

from sklearn.model_selection import train_test_split

# split our data into train & test
X_train, X_test, y_train, y_test = train_test_split(numerical_features, target, random_state=123)

# fit our pipeline model
model.fit(X_train, y_train)

# score our model on the test data
model.score(X_test, y_test)
0.8136106788960773

Preprocessing categorical data#

Unfortunately, Scikit-learn does not accept categorical features in their raw form. Consequently, we need to transform them into numerical representations.

The following presents typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.

Encoding ordinal categories#

The most intuitive strategy is to encode each category with a different number. The OrdinalEncoder will transform the data in such manner. We will start by encoding a single column to understand how the encoding works.

from sklearn.preprocessing import OrdinalEncoder

# let's illustrate with the 'education' feature
education_column = features[["education"]]

encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[ 1.],
       [11.],
       [ 7.],
       ...,
       [11.],
       [11.],
       [11.]])

We see that each category in "education" has been replaced by a numeric value. We could check the mapping between the categories and the numerical values by checking the fitted attribute categories_.

encoder.categories_
[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object)]

Note

OrindalEncoder transforms the category value into the corresponding index value of encoder.categories_

However, be careful when applying this encoding strategy: using this integer representation leads downstream predictive models to assume that the values are ordered (0 < 1 < 2 < 3… for instance).

By default, OrdinalEncoder uses a lexicographical strategy to map string category labels to integers. This strategy is arbitrary and often meaningless. For instance, suppose the dataset has a categorical variable named "size" with categories such as “S”, “M”, “L”, “XL”. We would like the integer representation to respect the meaning of the sizes by mapping them to increasing integers such as 0, 1, 2, 3. However, the lexicographical strategy used by default would map the labels “S”, “M”, “L”, “XL” to 2, 1, 0, 3, by following the alphabetical order.

The OrdinalEncoder class accepts a categories argument to pass categories in the expected ordering explicitly (categories[i] holds the categories expected in the ith column).

ed_levels = [' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th', ' 11th',
             ' 12th', ' HS-grad', ' Prof-school', ' Some-college', ' Assoc-acdm',
             ' Assoc-voc', ' Bachelors', ' Masters', ' Doctorate']

encoder = OrdinalEncoder(categories=[ed_levels])
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[ 6.],
       [ 8.],
       [11.],
       ...,
       [ 8.],
       [ 8.],
       [ 8.]])
encoder.categories_
[array([' Preschool', ' 1st-4th', ' 5th-6th', ' 7th-8th', ' 9th', ' 10th',
        ' 11th', ' 12th', ' HS-grad', ' Prof-school', ' Some-college',
        ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Masters',
        ' Doctorate'], dtype=object)]

If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (discussed next).

Ecoding nominal categories#

OneHotEncoder is an alternative encoder that converts the categorical levels into new columns.

We will start by encoding a single feature (e.g. "education") to illustrate how the encoding works.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
education_encoded = encoder.fit_transform(education_column)
education_encoded
array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Note

sparse_output=False is used in the OneHotEncoder for didactic purposes, namely easier visualization of the data.

Sparse matrices are efficient data structures when most of your matrix elements are zero. They won’t be covered in detail in this workshop. If you want more details about them, you can look at this.

Viewing this as a data frame provides a more intuitive illustration:

feature_names = encoder.get_feature_names_out(input_features=["education"])
pd.DataFrame(education_encoded, columns=feature_names)
education_ 10th education_ 11th education_ 12th education_ 1st-4th education_ 5th-6th education_ 7th-8th education_ 9th education_ Assoc-acdm education_ Assoc-voc education_ Bachelors education_ Doctorate education_ HS-grad education_ Masters education_ Preschool education_ Prof-school education_ Some-college
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48837 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
48838 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
48839 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
48840 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
48841 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

48842 rows × 16 columns

As we can see, each category (unique value) became a column; the encoding returned, for each sample, a 1 to specify which category it belongs to.

Let’s apply this encoding to all the categorical features:

# get all categorical features
categorical_features = features[categorical_columns]

# one-hot encode all features
categorical_features_encoded = encoder.fit_transform(categorical_features)

# view as a data frame
columns_encoded = encoder.get_feature_names_out(categorical_features.columns)
pd.DataFrame(categorical_features_encoded, columns=columns_encoded).head()
workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked workclass_ Private workclass_ Self-emp-inc workclass_ Self-emp-not-inc workclass_ State-gov workclass_ Without-pay education_ 10th ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows × 102 columns

Warning

One-hot encoding can significantly increase the number of features in our data. In this case we went from 8 features to 102! If you have a data set with many categorical variables and those categorical variables in turn have many unique levels, the number of features can explode. In these cases you may want to explore ordinal encoding or some other alternative.

Choosing an encoding strategy#

Choosing an encoding strategy will depend on the underlying models and the type of categories (i.e. ordinal vs. nominal).

Tip

In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models.

Using an OrdinalEncoder will output ordinal categories. This means that there is an order in the resulting categories (e.g. 0 < 1 < 2). The impact of violating this ordering assumption is really dependent on the downstream models. Linear models will be impacted by misordered categories while tree-based models will not.

You can still use an OrdinalEncoder with linear models but you need to be sure that:

  • the original categories (before encoding) have an ordering;

  • the encoded categories follow the same ordering than the original categories.

One-hot encoding categorical variables with high cardinality can cause computational inefficiency in tree-based models. Because of this, it is not recommended to use OneHotEncoder in such cases even if the original categories do not have a given order.

Using numerical and categorical variables together#

Now let’s look at how to combine some of these tasks so we can preprocess both numeric and categorical data.

First, let’s get our train & test data established:

# drop the duplicated column `"education-num"` as stated in the data exploration notebook
features = features.drop(columns='education-num')

# create selector object based on data type
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

# get columns of interest
numerical_columns = numerical_columns_selector(features)
categorical_columns = categorical_columns_selector(features)

# split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=123)

Scikit-learn provides a ColumnTransformer class which will send specific columns to a specific transformer, making it easy to fit a single predictive model on a dataset that combines both kinds of variables together.

We first define the columns depending on their data type:

  • one-hot encoding will be applied to categorical columns.

  • numerical scaling numerical features which will be standardized.

We then create our ColumnTransfomer by specifying three values:

  1. the preprocessor name,

  2. the transformer, and

  3. the columns.

First, let’s create the preprocessors for the numerical and categorical parts.

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

Tip

We can use the handle_unknown parameter to ignore rare categories that may show up in test data but were not present in the training data.

Now, we create the transformer and associate each of these preprocessors with their respective columns.

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard_scaler', numerical_preprocessor, numerical_columns)
])

We can take a minute to represent graphically the structure of a ColumnTransformer:

ColumnTransformer

Fig. 21 ColumnTransformer representation.#

A ColumnTransformer does the following:

  • It splits the columns of the original dataset based on the column names or indices provided. We will obtain as many subsets as the number of transformers passed into the ColumnTransformer.

  • It transforms each subset. A specific transformer is applied to each subset: it will internally call fit_transform or transform. The output of this step is a set of transformed datasets.

  • It then concatenates the transformed datasets into a single dataset.

The important thing is that ColumnTransformer is like any other scikit-learn transformer. In particular it can be combined with a classifier in a Pipeline:

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
model
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'sex',
                                                   'native-country']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['age', 'capital-gain',
                                                   'capital-loss',
                                                   'hours-per-week'])])),
                ('logisticregression', LogisticRegression(max_iter=500))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Warning

Including non-scaled data can cause some algorithms to iterate longer in order to converge. Since our categorical features are not scaled it’s often recommended to increase the number of allowed iterations for linear models.

# fit our model
_ = model.fit(X_train, y_train)

# score on test set
model.score(X_test, y_test)
0.8502170174432888

Wrapping up#

Unfortunately, we only have time to scratch the surface of feature engineering in this lesson. However, this module should provide you with a strong foundation of how to apply the more common feature preprocessing tasks.

Tip

Scikit-learn provides many feature engineering options. Learn more here: https://scikit-learn.org/stable/modules/preprocessing.html

Exercises#

Questions:

Using the ames_clean.csv data:

  • Numeric features:

    • Select GrLivArea and YearBuilt for features and SalePrice for the target

    • Create a train-test split and use random_state=123

    • Create a model pipeline that applies StandardScaler() to the features and then applies a LinearRegression model.

    • What is the score for this model based on the test data?

  • Adding a categorical feature:

    • Select GrLivArea, YearBuilt, and Neighborhood for features and SalePrice for the target

    • Create a train-test split and use random_state=123

    • Create a ColumnTransformer object that applies StandardScaler() to the numeric features and OneHotEncoder to the categorical feature.

    • Create a model pipeline that combines the ColumnTransformer object with a LinearRegression model.

    • What is the score for this model based on the test data?

Computing environment#

%load_ext watermark
%watermark -v -p jupyterlab,pandas,sklearn
Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.26.0

jupyterlab: 4.2.3
pandas    : 2.2.2
sklearn   : 1.5.1