2 Lesson 1a: Intro to machine learning

Machine learning (ML) continues to grow in importance for many organizations across nearly all domains. Some example applications of machine learning in practice include:

Predicting the likelihood of a patient returning to the hospital (readmission) within 30 days of discharge.
Segmenting customers based on common attributes or purchasing behavior for targeted marketing.
Predicting coupon redemption rates for a given marketing campaign.
Predicting customer churn so an organization can perform preventative intervention.
And many more!

In essence, these tasks all seek to learn from data. To address each scenario, we can use a given set of features to train an algorithm and extract insights. These algorithms, or learners, can be classified according to the amount and type of supervision needed during training.

2.1 Learning objectives

This lesson will introduce you to some fundamental concepts around ML and this class. By the end of this lesson you will:

Be able to explain the difference between supervised and unsupervised learning.
Know when a problem is considered a regression or classification problem.
Be able to import and explore the data sets we’ll use through various examples.

2.2 Supervised learning

A predictive model is used for tasks that involve the prediction of a given output (or target) using other variables (or features) in the data set. The learning algorithm in a predictive model attempts to discover and model the relationships among the target variable (the variable being predicted) and the other features (aka predictor variables). Examples of predictive modeling include:

using customer attributes to predict the probability of the customer churning in the next 6 weeks;
using home attributes to predict the sales price;
using employee attributes to predict the likelihood of attrition;
using patient attributes and symptoms to predict the risk of readmission;
using production attributes to predict time to market.

Each of these examples has a defined learning task; they each intend to use attributes ( $X$ ) to predict an outcome measurement ( $Y$ ).

Throughout this course we’ll use various terms interchangeably for

$X$ : “predictor variable”, “independent variable”, “attribute”, “feature”, “predictor”
$Y$ : “target variable”, “dependent variable”, “response”, “outcome measurement”

The predictive modeling examples above describe what is known as supervised learning. The supervision refers to the fact that the target values provide a supervisory role, which indicates to the learner the task it needs to learn. Specifically, given a set of data, the learning algorithm attempts to optimize a function (the algorithmic steps) to find the combination of feature values that results in a predicted value that is as close to the actual target output as possible.

In supervised learning, the training data you feed the algorithm includes the target values. Consequently, the solutions can be used to help supervise the training process to find the optimal algorithm parameters.

Most supervised learning problems can be bucketed into one of two categories, regression or classification, which we discuss next.

2.2.1 Regression problems

When the objective of our supervised learning is to predict a numeric outcome, we refer to this as a regression problem (not to be confused with linear regression modeling). Regression problems revolve around predicting output that falls on a continuum. In the examples above, predicting home sales prices and time to market reflect a regression problem because the output is numeric and continuous. This means, given the combination of predictor values, the response value could fall anywhere along some continuous spectrum (e.g., the predicted sales price of a particular home could be between $80,000 and $755,000). The figure below illustrates average home sales prices as a function of two home features: year built and total square footage. Depending on the combination of these two features, the expected home sales price could fall anywhere along a plane.

Figure 2.1: Average home sales price as a function of year built and total square footage.

2.2.2 Classification problems

When the objective of our supervised learning is to predict a categorical outcome, we refer to this as a classification problem. Classification problems most commonly revolve around predicting a binary or multinomial response measure such as:

Did a customer redeem a coupon (coded as yes/no or 1/0)?
Did a customer churn (coded as yes/no or 1/0)?
Did a customer click on our online ad (coded as yes/no or 1/0)?
Classifying customer reviews:
- Binary: positive vs. negative.
- Multinomial: extremely negative to extremely positive on a 0–5 Likert scale.

Figure 2.2: Classification problem modeling ‘Yes’/‘No’ response based on three features.

However, when we apply machine learning models for classification problems, rather than predict a particular class (i.e., “yes” or “no”), we often want to predict the probability of a particular class (i.e., yes: 0.65, no: 0.35). By default, the class with the highest predicted probability becomes the predicted class. Consequently, even though we are performing a classification problem, we are still predicting a numeric output (probability). However, the essence of the problem still makes it a classification problem.

Although there are machine learning algorithms that can be applied to regression problems but not classification and vice versa, many of the supervised learning algorithms we cover in this class can be applied to both. These algorithms have become the most popular machine learning applications in recent years.

2.2.3 Knowledge check

Identify the features, response variable, and the type of supervised model required for the following tasks:

There is an online retailer that wants to predict whether you will click on a certain featured product given your demographics, the current products in your online basket, and the time since your previous purchase.
A bank wants to use a customers historical data such as the number of loans they’ve had, the time it took to payoff those loans, previous loan defaults, the number of new loans within the past two years, along with the customers income and level of education to determine if they should issue a new loan for a car.
If the bank above does issue a new loan, they want to use the same information to determine the interest rate of the new loan issued.
To better plan incoming and outgoing flights, an airline wants to use flight information such as scheduled flight time, day/month of year, number of passengers, airport departing from, airport arriving to, distance to travel, and weather warnings to determine if a flight will be delayed.
What if the above airline wants to use the same information to predict the number of minutes a flight will arrive late or early?

2.3 Unsupervised learning

Unsupervised learning, in contrast to supervised learning, includes a set of statistical tools to better understand and describe your data, but performs the analysis without a target variable. In essence, unsupervised learning is concerned with identifying groups in a data set. The groups may be defined by the rows (i.e., clustering) or the columns (i.e., dimension reduction); however, the motive in each case is quite different.

The goal of clustering is to segment observations into similar groups based on the observed variables; for example, to divide consumers into different homogeneous groups, a process known as market segmentation. In dimension reduction, we are often concerned with reducing the number of variables in a data set. For example, classical linear regression models break down in the presence of highly correlated features. Some dimension reduction techniques can be used to reduce the feature set to a potentially smaller set of uncorrelated variables. Such a reduced feature set is often used as input to downstream supervised learning models (e.g., principal component regression).

Unsupervised learning is often performed as part of an exploratory data analysis (EDA). However, the exercise tends to be more subjective, and there is no simple goal for the analysis, such as prediction of a response. Furthermore, it can be hard to assess the quality of results obtained from unsupervised learning methods. The reason for this is simple. If we fit a predictive model using a supervised learning technique (i.e., linear regression), then it is possible to check our work by seeing how well our model predicts the response Y on observations not used in fitting the model. However, in unsupervised learning, there is no way to check our work because we don’t know the true answer—the problem is unsupervised!

Despite its subjectivity, the importance of unsupervised learning should not be overlooked and such techniques are often used in organizations to:

Divide consumers into different homogeneous groups so that tailored marketing strategies can be developed and deployed for each segment.
Identify groups of online shoppers with similar browsing and purchase histories, as well as items that are of particular interest to the shoppers within each group. Then an individual shopper can be preferentially shown the items in which he or she is particularly likely to be interested, based on the purchase histories of similar shoppers.
Identify products that have similar purchasing behavior so that managers can manage them as product groups.

These questions, and many more, can be addressed with unsupervised learning. Moreover, the outputs of unsupervised learning models can be used as inputs to downstream supervised learning models.

2.3.1 Knowledge check

Identify the type of unsupervised model required for the following tasks:

Say you have a YouTube channel. You may have a lot of data about the subscribers of your channel. What if you want to use that data to detect groups of similar subscribers?
Say you’d like to group Ohio counties together based on the demographics of their residents.
A retailer has collected hundreds of attributes about all their customers; however, many of those features are highly correlated. They’d like to reduce the number of features down by combining all those highly correlated features into groups.

2.4 Machine Learning in

Historically, the R ecosystem provides a wide variety of ML algorithm implementations. This has its benefits; however, this also has drawbacks as it requires the users to learn many different formula interfaces and syntax nuances.

More recently, development on a group of packages called Tidymodels has helped to make implementation easier. The tidymodels collection allows you to perform discrete parts of the ML workflow with discrete packages:

rsample for data splitting and resampling
recipes for data pre-processing and feature engineering
parsnip for applying algorithms
tune for hyperparameter tuning
yardstick for measuring model performance
and several others!

Throughout this course you’ll be exposed to several of these packages. Go ahead and make sure you have the following packages installed.

Just like the tidyverse package, when you install tidymodels you are actually installing several packages that exist in the tidymodels ecosystem as discussed above.

# common data wrangling and visualization
install.packages("tidyverse")
install.packages("vip")
install.packages("here")

# modeling
install.packages("tidymodels")

packageVersion("tidymodels")
## [1] '1.2.0'

library(tidymodels)
## ── Attaching packages ───────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6     ✔ rsample      1.2.1
## ✔ dials        1.2.1     ✔ tibble       3.2.1
## ✔ dplyr        1.1.4     ✔ tidyr        1.3.1
## ✔ infer        1.0.7     ✔ tune         1.2.1
## ✔ modeldata    1.4.0     ✔ workflows    1.1.4
## ✔ parsnip      1.2.1     ✔ workflowsets 1.1.0
## ✔ purrr        1.0.2     ✔ yardstick    1.3.1
## ✔ recipes      1.1.0
## ── Conflicts ──────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks plotly::filter(), stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org

2.4.1 Knowledge check

Check out the Tidymodels website: https://www.tidymodels.org/. Identify which packages can be used for:

Efficiently splitting your data
Optimizing hyperparameters
Measuring the effectiveness of your model
Working with correlation matrices

2.5 The data sets

The data sets chosen for this course allow us to illustrate the different features of the presented machine learning algorithms. Since the goal of this course is to demonstrate how to implement ML workflows, we make the assumption that you have already spent significant time wrangling, cleaning and getting to know your data via exploratory data analysis. This would allow you to perform many necessary tasks prior to the ML tasks outlined in this course such as:

Feature selection (i.e., removing unnecessary variables and retaining only those variables you wish to include in your modeling process).
Recoding variable names and values so that they are meaningful and more interpretable.
Tidying data so that each column is a discrete variable and each row is an individual observation.
Recoding, removing, or some other approach to handling missing values.

Consequently, the exemplar data sets we use throughout this book have, for the most part, gone through the necessary cleaning processes. As mentioned above, these data sets are fairly common data sets that provide good benchmarks to compare and illustrate ML workflows. Although some of these data sets are available in R, we will import these data sets from a .csv file to ensure consistency over time.

2.5.1 Boston housing

The Boston Housing data set is derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. Originally published in Harrison Jr and Rubinfeld (1978) , it contains 13 attributes to predict the median property value.

problem type: supervised regression
response variable: medv median value of owner-occupied homes in USD 1000’s (i.e. 21.8, 24.5)
features: 13
observations: 506
objective: use property attributes to predict the median value of owner-occupied homes

# data file path
library(here)
data_path <- here("data")

# access data
boston <- readr::read_csv(here(data_path, "boston.csv"))

# initial dimension
dim(boston)
## [1] 506  16

# features
dplyr::select(boston, -cmedv)
## # A tibble: 506 × 15
##      lon   lat    crim    zn indus  chas   nox    rm   age   dis   rad
##    <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 -71.0  42.3 0.00632  18    2.31     0 0.538  6.58  65.2  4.09     1
##  2 -71.0  42.3 0.0273    0    7.07     0 0.469  6.42  78.9  4.97     2
##  3 -70.9  42.3 0.0273    0    7.07     0 0.469  7.18  61.1  4.97     2
##  4 -70.9  42.3 0.0324    0    2.18     0 0.458  7.00  45.8  6.06     3
##  5 -70.9  42.3 0.0690    0    2.18     0 0.458  7.15  54.2  6.06     3
##  6 -70.9  42.3 0.0298    0    2.18     0 0.458  6.43  58.7  6.06     3
##  7 -70.9  42.3 0.0883   12.5  7.87     0 0.524  6.01  66.6  5.56     5
##  8 -70.9  42.3 0.145    12.5  7.87     0 0.524  6.17  96.1  5.95     5
##  9 -70.9  42.3 0.211    12.5  7.87     0 0.524  5.63 100    6.08     5
## 10 -70.9  42.3 0.170    12.5  7.87     0 0.524  6.00  85.9  6.59     5
## # ℹ 496 more rows
## # ℹ 4 more variables: tax <dbl>, ptratio <dbl>, b <dbl>, lstat <dbl>

# response variable
head(boston$cmedv)
## [1] 24.0 21.6 34.7 33.4 36.2 28.7

2.5.2 Pima Indians Diabetes

A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases and published in smith1988using , it contains 8 attributes to predict the presence of diabetes.

problem type: supervised binary classification
response variable: diabetes positive or negative response (i.e. “pos”, “neg”)
features: 8
observations: 768
objective: use biological attributes to predict the presence of diabetes

# access data
pima <- readr::read_csv(here(data_path, "pima.csv"))

# initial dimension
dim(pima)
## [1] 768   9

# features
dplyr::select(pima, -diabetes)
## # A tibble: 768 × 8
##    pregnant glucose pressure triceps insulin  mass pedigree   age
##       <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl> <dbl>
##  1        6     148       72      35       0  33.6    0.627    50
##  2        1      85       66      29       0  26.6    0.351    31
##  3        8     183       64       0       0  23.3    0.672    32
##  4        1      89       66      23      94  28.1    0.167    21
##  5        0     137       40      35     168  43.1    2.29     33
##  6        5     116       74       0       0  25.6    0.201    30
##  7        3      78       50      32      88  31      0.248    26
##  8       10     115        0       0       0  35.3    0.134    29
##  9        2     197       70      45     543  30.5    0.158    53
## 10        8     125       96       0       0   0      0.232    54
## # ℹ 758 more rows

# response variable
head(pima$diabetes)
## [1] "pos" "neg" "pos" "neg" "pos" "neg"

2.5.3 Iris flowers

The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper (R. A. Fisher 1936) . It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.

problem type: supervised multinomial classification
response variable: species (i.e. “setosa”, “virginica”, “versicolor”)
features: 4
observations: 150
objective: use plant leaf attributes to predict the type of flower

# access data
iris <- readr::read_csv(here(data_path, "iris.csv"))

# initial dimension
dim(iris)
## [1] 150   5

# features
dplyr::select(iris, -Species)
## # A tibble: 150 × 4
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
##           <dbl>       <dbl>        <dbl>       <dbl>
##  1          5.1         3.5          1.4         0.2
##  2          4.9         3            1.4         0.2
##  3          4.7         3.2          1.3         0.2
##  4          4.6         3.1          1.5         0.2
##  5          5           3.6          1.4         0.2
##  6          5.4         3.9          1.7         0.4
##  7          4.6         3.4          1.4         0.3
##  8          5           3.4          1.5         0.2
##  9          4.4         2.9          1.4         0.2
## 10          4.9         3.1          1.5         0.1
## # ℹ 140 more rows

# response variable
head(iris$Species)
## [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"

2.5.4 Ames housing

The Ames housing data set is an alternative to the Boston housing data set and provides a more comprehensive set of home features to predict sales price. More information can be found in De Cock (2011) .

problem type: supervised regression
response variable: Sale_Price (i.e., $195,000, $215,000)
features: 80
observations: 2,930
objective: use property attributes to predict the sale price of a home

# access data
ames <- readr::read_csv(here(data_path, "ames.csv"))

# initial dimension
dim(ames)
## [1] 2930   81

# features
dplyr::select(ames, -Sale_Price)
## # A tibble: 2,930 × 80
##    MS_SubClass   MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
##    <chr>         <chr>            <dbl>    <dbl> <chr>  <chr> <chr>    
##  1 One_Story_19… Resident…          141    31770 Pave   No_A… Slightly…
##  2 One_Story_19… Resident…           80    11622 Pave   No_A… Regular  
##  3 One_Story_19… Resident…           81    14267 Pave   No_A… Slightly…
##  4 One_Story_19… Resident…           93    11160 Pave   No_A… Regular  
##  5 Two_Story_19… Resident…           74    13830 Pave   No_A… Slightly…
##  6 Two_Story_19… Resident…           78     9978 Pave   No_A… Slightly…
##  7 One_Story_PU… Resident…           41     4920 Pave   No_A… Regular  
##  8 One_Story_PU… Resident…           43     5005 Pave   No_A… Slightly…
##  9 One_Story_PU… Resident…           39     5389 Pave   No_A… Slightly…
## 10 Two_Story_19… Resident…           60     7500 Pave   No_A… Regular  
## # ℹ 2,920 more rows
## # ℹ 73 more variables: Land_Contour <chr>, Utilities <chr>,
## #   Lot_Config <chr>, Land_Slope <chr>, Neighborhood <chr>,
## #   Condition_1 <chr>, Condition_2 <chr>, Bldg_Type <chr>,
## #   House_Style <chr>, Overall_Qual <chr>, Overall_Cond <chr>,
## #   Year_Built <dbl>, Year_Remod_Add <dbl>, Roof_Style <chr>,
## #   Roof_Matl <chr>, Exterior_1st <chr>, Exterior_2nd <chr>, …

# response variable
head(ames$Sale_Price)
## [1] 215000 105000 172000 244000 189900 195500

2.5.5 Attrition

The employee attrition data set was originally provided by IBM Watson Analytics Lab and is a fictional data set created by IBM data scientists to explore what employee attributes influence attrition.

problem type: supervised binomial classification
response variable: Attrition (i.e., “Yes”, “No”)
features: 30
observations: 1,470
objective: use employee attributes to predict if they will attrit (leave the company)

# access data
attrition <- readr::read_csv(here(data_path, "attrition.csv"))

# initial dimension
dim(attrition)
## [1] 1470   31

# features
dplyr::select(attrition, -Attrition)
## # A tibble: 1,470 × 30
##      Age BusinessTravel DailyRate Department DistanceFromHome Education
##    <dbl> <chr>              <dbl> <chr>                 <dbl> <chr>    
##  1    41 Travel_Rarely       1102 Sales                     1 College  
##  2    49 Travel_Freque…       279 Research_…                8 Below_Co…
##  3    37 Travel_Rarely       1373 Research_…                2 College  
##  4    33 Travel_Freque…      1392 Research_…                3 Master   
##  5    27 Travel_Rarely        591 Research_…                2 Below_Co…
##  6    32 Travel_Freque…      1005 Research_…                2 College  
##  7    59 Travel_Rarely       1324 Research_…                3 Bachelor 
##  8    30 Travel_Rarely       1358 Research_…               24 Below_Co…
##  9    38 Travel_Freque…       216 Research_…               23 Bachelor 
## 10    36 Travel_Rarely       1299 Research_…               27 Bachelor 
## # ℹ 1,460 more rows
## # ℹ 24 more variables: EducationField <chr>,
## #   EnvironmentSatisfaction <chr>, Gender <chr>, HourlyRate <dbl>,
## #   JobInvolvement <chr>, JobLevel <dbl>, JobRole <chr>,
## #   JobSatisfaction <chr>, MaritalStatus <chr>, MonthlyIncome <dbl>,
## #   MonthlyRate <dbl>, NumCompaniesWorked <dbl>, OverTime <chr>,
## #   PercentSalaryHike <dbl>, PerformanceRating <chr>, …

# response variable
head(attrition$Attrition)
## [1] "Yes" "No"  "Yes" "No"  "No"  "No"

2.5.6 Hitters

This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. The idea was to illustrate if and how major league baseball player’s batting performance could predict their salary. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York. Note that the data does contain the players name but this should be removed during analysis and is not a valid feature.

problem type: supervised regression
response variable: Salary
features: 19
observations: 322
objective: use baseball player’s batting attributes to predict their salary.

# access data
hitters <- readr::read_csv(here(data_path, "hitters.csv"))

# initial dimension
dim(hitters)
## [1] 322  21

# features
dplyr::select(hitters, -Salary, -Player)
## # A tibble: 322 × 19
##    AtBat  Hits HmRun  Runs   RBI Walks Years CAtBat CHits CHmRun CRuns
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1   293    66     1    30    29    14     1    293    66      1    30
##  2   315    81     7    24    38    39    14   3449   835     69   321
##  3   479   130    18    66    72    76     3   1624   457     63   224
##  4   496   141    20    65    78    37    11   5628  1575    225   828
##  5   321    87    10    39    42    30     2    396   101     12    48
##  6   594   169     4    74    51    35    11   4408  1133     19   501
##  7   185    37     1    23     8    21     2    214    42      1    30
##  8   298    73     0    24    24     7     3    509   108      0    41
##  9   323    81     6    26    32     8     2    341    86      6    32
## 10   401    92    17    49    66    65    13   5206  1332    253   784
## # ℹ 312 more rows
## # ℹ 8 more variables: CRBI <dbl>, CWalks <dbl>, League <chr>,
## #   Division <chr>, PutOuts <dbl>, Assists <dbl>, Errors <dbl>,
## #   NewLeague <chr>

# response variable
head(hitters$Salary)
## [1]    NA 475.0 480.0 500.0  91.5 750.0

2.6 What You’ll Learn Next

The lessons that follow are designed to help you understand the individual sub-tasks of an ML project. The focus is to have an intuitive understanding of each discrete sub-task and algorithm. Once you understand when, where, and why these sub-tasks are performed you will be able to transfer this knowledge to other projects. The concepts you will learn include:

Provide an overview of the ML modeling process:
- data splitting
- model fitting
- model validation and tuning
- performance measurement
- feature engineering
Cover common supervised learners:
- linear regression
- regularized regression
- K-nearest neighbors
- decision trees
- bagging & random forests
- gradient boosting
Cover common unsupervised learners:
- K-means clustering
- Principal component analysis
Along the way you’ll learn about:
- each algorithm’s hyperparameters
- model interpretation
- feature importance
- and more!

2.7 Exercises

Identify four real-life applications of supervised and unsupervised problems.
- Explain what makes these problems supervised versus unsupervised.
- For each problem identify the target variable (if applicable) and potential features.
Identify and contrast a regression problem with a classification problem.
- What is the target variable in each problem and why would being able to accurately predict this target be beneficial to society?
- What are potential features and where could you collect this information?
- What is determining if the problem is a regression or a classification problem?
Identify three open source data sets suitable for machine learning (e.g., https://bit.ly/35wKu5c).
- Explain the type of machine learning models that could be constructed from the data (e.g., supervised versus unsupervised and regression versus classification).
- What are the dimensions of the data?
- Is there a code book that explains who collected the data, why it was originally collected, and what each variable represents?
- If the data set is suitable for supervised learning, which variable(s) could be considered as a useful target? Which variable(s) could be considered as features?
Identify examples of misuse of machine learning in society. What was the ethical concern?

References

De Cock, Dean. 2011. “Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project.” Journal of Statistics Education 19 (3).

Fisher, Ronald A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88.

Harrison Jr, David, and Daniel L Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” Journal of Environmental Economics and Management 5 (1): 81–102.