# Chapter 3 Feature & Target Engineering

Data preprocessing and engineering techniques generally refer to the addition, deletion, or transformation of data. The time spent on identifying data engineering needs can be significant and requires you to spend substantial time understanding your data…or as Leo Breiman said “live with your data before you plunge into modeling” (Breiman and others 2001, 201). Although this book primarily focuses on applying machine learning algorithms, feature engineering can make or break an algorithm’s predictive ability and deserves your continued focus and education.

We will not cover all the potential ways of implementing feature engineering; however, we’ll cover several fundamental preprocessing tasks that can potentially significantly improve modeling performance. Moreover, different models have different sensitivities to the type of target and feature values in the model and we will try to highlight some of these concerns. For more in depth coverage of feature engineering, please refer to Kuhn and Johnson (2019) and Zheng and Casari (2018).

## 3.1 Prerequisites

This chapter leverages the following packages:

# Helper packages
library(dplyr)    # for data manipulation
library(ggplot2)  # for awesome graphics

# Feature engineering packages
library(caret)    # for various ML tasks
library(recipes)  # for feature engineering tasks

We’ll also continue working with the ames_train data set created in Section 2.7:

## 3.2 Target engineering

Although not always a requirement, transforming the response variable can lead to predictive improvement, especially with parametric models (which require that certain assumptions about the model be met). For instance, ordinary linear regression models assume that the prediction errors (and hence the response) are normally distributed. This is usually fine, except when the prediction target has heavy tails (i.e., outliers) or is skewed in one direction or the other. In these cases, the normality assumption likely does not hold. For example, as we saw in the data splitting section (2.2), the response variable for the Ames housing data (Sale_Price) is right (or positively) skewed as illustrated in Figure 3.1 (ranging from $12,789 to$755,000). A simple linear model, say $$\text{Sale_Price}=\beta_{0} + \beta_{1} \text{Year_Built} + \epsilon$$, often assumes the error term $$\epsilon$$ (and hence Sale_Price) is normally distributed; fortunately, a simple log (or similar) transformation of the response can often help alleviate this concern as Figure 3.1 illustrates.

Furthermore, using a log (or other) transformation to minimize the response skewness can be used for shaping the business problem as well. For example, in the House Prices: Advanced Regression Techniques Kaggle competition11, which used the Ames housing data, the competition focused on using a log transformed Sale Price response because “…taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.” This would be an alternative to using the root mean squared logarithmic error (RMSLE) loss function as discussed in Section 2.6.

There are two main approaches to help correct for positively skewed target variables:

Option 1: normalize with a log transformation. This will transform most right skewed distributions to be approximately normal. One way to do this is to simply log transform the training and test set in a manual, single step manner similar to:

### References

Box, George EP, and David R Cox. 1964. “An Analysis of Transformations.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 211–52.

Breiman, Leo, and others. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3). Institute of Mathematical Statistics: 199–231.

Carroll, Raymond J, and David Ruppert. 1981. “On Prediction and the Power Transformation Family.” Biometrika 68 (3). Oxford University Press: 609–15.

Chollet, François, and Joseph J Allaire. 2018. Deep Learning with R. Manning Publications Company.

Gower, John C. 1971. “A General Coefficient of Similarity and Some of Its Properties.” Biometrics. JSTOR, 857–71.

Granitto, Pablo M, Cesare Furlanello, Franco Biasioli, and Flavia Gasperi. 2006. “Recursive Feature Elimination with Random Forest for Ptr-Ms Analysis of Agroindustrial Products.” Chemometrics and Intelligent Laboratory Systems 83 (2). Elsevier: 83–90.

Guo, Cheng, and Felix Berkhahn. 2016. “Entity Embeddings of Categorical Variables.” arXiv Preprint arXiv:1604.06737.

Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Kuhn, Max, and Kjell Johnson. 2019. Feature Engineering and Selection: A Practical Approach for Predictive Models. Chapman & Hall/CRC.

Kursa, Miron B, Witold R Rudnicki, and others. 2010. “Feature Selection with the Boruta Package.” J Stat Softw 36 (11): 1–13.

Little, Roderick JA, and Donald B Rubin. 2014. Statistical Analysis with Missing Data. Vol. 333. John Wiley & Sons.

Maldonado, Sebastián, and Richard Weber. 2009. “A Wrapper Method for Feature Selection Using Support Vector Machines.” Information Sciences 179 (13). Elsevier: 2208–17.

Micci-Barreca, Daniele. 2001. “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems.” ACM SIGKDD Explorations Newsletter 3 (1). ACM: 27–32.

Saeys, Yvan, Iñaki Inza, and Pedro Larrañaga. 2007. “A Review of Feature Selection Techniques in Bioinformatics.” Bioinformatics 23 (19). Oxford University Press: 2507–17.

Shah, Anoop D, Jonathan W Bartlett, James Carpenter, Owen Nicholas, and Harry Hemingway. 2014. “Comparison of Random Forest and Parametric Imputation Models for Imputing Missing Data Using Mice: A Caliber Study.” American Journal of Epidemiology 179 (6). Oxford University Press: 764–74.

Stekhoven, Daniel J. 2015. “MissForest: Nonparametric Missing Value Imputation Using Random Forest.” Astrophysics Source Code Library.

Tierney, Nicholas. 2019. Visdat: Preliminary Visualisation of Data. https://CRAN.R-project.org/package=visdat.

West, Brady T, Kathleen B Welch, and Andrzej T Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. Chapman; Hall/CRC.

Zheng, Alice, and Amanda Casari. 2018. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists. O’Reilly Media, Inc.

Zumel, Nina, and John Mount. 2016. “Vtreat: A Data. Frame Processor for Predictive Modeling.” arXiv Preprint arXiv:1611.09477.

1. Little and Rubin (2014) discuss two different kinds of missingness at random; however, we combine them for simplicity as their nuanced differences are distinguished between the two in practice.

2. If your data set is large, deleting missing observations that have missing values at random rarely impacts predictive performance. However, as your data sets get smaller, preserving observations is critical and alternative solutions should be explored.

3. For example, standardizing numeric features will include the imputed numeric values in the calculation and one-hot encoding will include the imputed categorical value.

4. See Kuhn and Johnson (2013) Section 19.1 for data set generation.

5. In fact, most of the feature engineering capabilities found in resample can also be found in caret::preProcess().