2 Lesson 1a: Intro to machine learning
Machine learning (ML) continues to grow in importance for many organizations across nearly all domains. Some example applications of machine learning in practice include:
- Predicting the likelihood of a patient returning to the hospital (readmission) within 30 days of discharge.
- Segmenting customers based on common attributes or purchasing behavior for targeted marketing.
- Predicting coupon redemption rates for a given marketing campaign.
- Predicting customer churn so an organization can perform preventative intervention.
- And many more!
In essence, these tasks all seek to learn from data. To address each scenario, we can use a given set of features to train an algorithm and extract insights. These algorithms, or learners, can be classified according to the amount and type of supervision needed during training.
2.1 Learning objectives
This lesson will introduce you to some fundamental concepts around ML and this class. By the end of this lesson you will:
- Be able to explain the difference between supervised and unsupervised learning.
- Know when a problem is considered a regression or classification problem.
- Be able to import and explore the data sets we’ll use through various examples.
2.2 Supervised learning
A predictive model is used for tasks that involve the prediction of a given output (or target) using other variables (or features) in the data set. The learning algorithm in a predictive model attempts to discover and model the relationships among the target variable (the variable being predicted) and the other features (aka predictor variables). Examples of predictive modeling include:
- using customer attributes to predict the probability of the customer churning in the next 6 weeks;
- using home attributes to predict the sales price;
- using employee attributes to predict the likelihood of attrition;
- using patient attributes and symptoms to predict the risk of readmission;
- using production attributes to predict time to market.
Each of these examples has a defined learning task; they each intend to use attributes (\(X\)) to predict an outcome measurement (\(Y\)).
Throughout this course we’ll use various terms interchangeably for
- \(X\): “predictor variable”, “independent variable”, “attribute”, “feature”, “predictor”
- \(Y\): “target variable”, “dependent variable”, “response”, “outcome measurement”
The predictive modeling examples above describe what is known as supervised learning. The supervision refers to the fact that the target values provide a supervisory role, which indicates to the learner the task it needs to learn. Specifically, given a set of data, the learning algorithm attempts to optimize a function (the algorithmic steps) to find the combination of feature values that results in a predicted value that is as close to the actual target output as possible.
In supervised learning, the training data you feed the algorithm includes the target values. Consequently, the solutions can be used to help supervise the training process to find the optimal algorithm parameters.
Most supervised learning problems can be bucketed into one of two categories, regression or classification, which we discuss next.
2.2.1 Regression problems
When the objective of our supervised learning is to predict a numeric outcome, we refer to this as a regression problem (not to be confused with linear regression modeling). Regression problems revolve around predicting output that falls on a continuum. In the examples above, predicting home sales prices and time to market reflect a regression problem because the output is numeric and continuous. This means, given the combination of predictor values, the response value could fall anywhere along some continuous spectrum (e.g., the predicted sales price of a particular home could be between $80,000 and $755,000). The figure below illustrates average home sales prices as a function of two home features: year built and total square footage. Depending on the combination of these two features, the expected home sales price could fall anywhere along a plane.
2.2.2 Classification problems
When the objective of our supervised learning is to predict a categorical outcome, we refer to this as a classification problem. Classification problems most commonly revolve around predicting a binary or multinomial response measure such as:
- Did a customer redeem a coupon (coded as yes/no or 1/0)?
- Did a customer churn (coded as yes/no or 1/0)?
- Did a customer click on our online ad (coded as yes/no or 1/0)?
- Classifying customer reviews:
- Binary: positive vs. negative.
- Multinomial: extremely negative to extremely positive on a 0–5 Likert scale.
However, when we apply machine learning models for classification problems, rather than predict a particular class (i.e., “yes” or “no”), we often want to predict the probability of a particular class (i.e., yes: 0.65, no: 0.35). By default, the class with the highest predicted probability becomes the predicted class. Consequently, even though we are performing a classification problem, we are still predicting a numeric output (probability). However, the essence of the problem still makes it a classification problem.
Although there are machine learning algorithms that can be applied to regression problems but not classification and vice versa, many of the supervised learning algorithms we cover in this class can be applied to both. These algorithms have become the most popular machine learning applications in recent years.
2.2.3 Knowledge check
Identify the features, response variable, and the type of supervised model required for the following tasks:
- There is an online retailer that wants to predict whether you will click on a certain featured product given your demographics, the current products in your online basket, and the time since your previous purchase.
- A bank wants to use a customers historical data such as the number of loans they’ve had, the time it took to payoff those loans, previous loan defaults, the number of new loans within the past two years, along with the customers income and level of education to determine if they should issue a new loan for a car.
- If the bank above does issue a new loan, they want to use the same information to determine the interest rate of the new loan issued.
- To better plan incoming and outgoing flights, an airline wants to use flight information such as scheduled flight time, day/month of year, number of passengers, airport departing from, airport arriving to, distance to travel, and weather warnings to determine if a flight will be delayed.
- What if the above airline wants to use the same information to predict the number of minutes a flight will arrive late or early?
2.3 Unsupervised learning
Unsupervised learning, in contrast to supervised learning, includes a set of statistical tools to better understand and describe your data, but performs the analysis without a target variable. In essence, unsupervised learning is concerned with identifying groups in a data set. The groups may be defined by the rows (i.e., clustering) or the columns (i.e., dimension reduction); however, the motive in each case is quite different.
The goal of clustering is to segment observations into similar groups based on the observed variables; for example, to divide consumers into different homogeneous groups, a process known as market segmentation. In dimension reduction, we are often concerned with reducing the number of variables in a data set. For example, classical linear regression models break down in the presence of highly correlated features. Some dimension reduction techniques can be used to reduce the feature set to a potentially smaller set of uncorrelated variables. Such a reduced feature set is often used as input to downstream supervised learning models (e.g., principal component regression).
Unsupervised learning is often performed as part of an exploratory data analysis (EDA). However, the exercise tends to be more subjective, and there is no simple goal for the analysis, such as prediction of a response. Furthermore, it can be hard to assess the quality of results obtained from unsupervised learning methods. The reason for this is simple. If we fit a predictive model using a supervised learning technique (i.e., linear regression), then it is possible to check our work by seeing how well our model predicts the response Y on observations not used in fitting the model. However, in unsupervised learning, there is no way to check our work because we don’t know the true answer—the problem is unsupervised!
Despite its subjectivity, the importance of unsupervised learning should not be overlooked and such techniques are often used in organizations to:
- Divide consumers into different homogeneous groups so that tailored marketing strategies can be developed and deployed for each segment.
- Identify groups of online shoppers with similar browsing and purchase histories, as well as items that are of particular interest to the shoppers within each group. Then an individual shopper can be preferentially shown the items in which he or she is particularly likely to be interested, based on the purchase histories of similar shoppers.
- Identify products that have similar purchasing behavior so that managers can manage them as product groups.
These questions, and many more, can be addressed with unsupervised learning. Moreover, the outputs of unsupervised learning models can be used as inputs to downstream supervised learning models.
2.3.1 Knowledge check
Identify the type of unsupervised model required for the following tasks:
- Say you have a YouTube channel. You may have a lot of data about the subscribers of your channel. What if you want to use that data to detect groups of similar subscribers?
- Say you’d like to group Ohio counties together based on the demographics of their residents.
- A retailer has collected hundreds of attributes about all their customers; however, many of those features are highly correlated. They’d like to reduce the number of features down by combining all those highly correlated features into groups.
2.4 Machine Learning in
Historically, the R ecosystem provides a wide variety of ML algorithm implementations. This has its benefits; however, this also has drawbacks as it requires the users to learn many different formula interfaces and syntax nuances.
More recently, development on a group of packages called Tidymodels has helped to make implementation easier. The tidymodels collection allows you to perform discrete parts of the ML workflow with discrete packages:
- rsample for data splitting and resampling
- recipes for data pre-processing and feature engineering
- parsnip for applying algorithms
- tune for hyperparameter tuning
- yardstick for measuring model performance
- and several others!
Throughout this course you’ll be exposed to several of these packages. Go ahead and make sure you have the following packages installed.
Just like the tidyverse package, when you install tidymodels you are actually installing several packages that exist in the tidymodels ecosystem as discussed above.
# common data wrangling and visualization
install.packages("tidyverse")
install.packages("vip")
install.packages("here")
# modeling
install.packages("tidymodels")
packageVersion("tidymodels")
## [1] '1.2.0'
library(tidymodels)
## ── Attaching packages ───────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ rsample 1.2.1
## ✔ dials 1.2.1 ✔ tibble 3.2.1
## ✔ dplyr 1.1.4 ✔ tidyr 1.3.1
## ✔ infer 1.0.7 ✔ tune 1.2.1
## ✔ modeldata 1.4.0 ✔ workflows 1.1.4
## ✔ parsnip 1.2.1 ✔ workflowsets 1.1.0
## ✔ purrr 1.0.2 ✔ yardstick 1.3.1
## ✔ recipes 1.1.0
## ── Conflicts ──────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks plotly::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
2.4.1 Knowledge check
Check out the Tidymodels website: https://www.tidymodels.org/. Identify which packages can be used for:
- Efficiently splitting your data
- Optimizing hyperparameters
- Measuring the effectiveness of your model
- Working with correlation matrices
2.5 The data sets
The data sets chosen for this course allow us to illustrate the different features of the presented machine learning algorithms. Since the goal of this course is to demonstrate how to implement ML workflows, we make the assumption that you have already spent significant time wrangling, cleaning and getting to know your data via exploratory data analysis. This would allow you to perform many necessary tasks prior to the ML tasks outlined in this course such as:
- Feature selection (i.e., removing unnecessary variables and retaining only those variables you wish to include in your modeling process).
- Recoding variable names and values so that they are meaningful and more interpretable.
- Tidying data so that each column is a discrete variable and each row is an individual observation.
- Recoding, removing, or some other approach to handling missing values.
Consequently, the exemplar data sets we use throughout this book have, for the most part, gone through the necessary cleaning processes. As mentioned above, these data sets are fairly common data sets that provide good benchmarks to compare and illustrate ML workflows. Although some of these data sets are available in R, we will import these data sets from a .csv file to ensure consistency over time.
2.5.1 Boston housing
The Boston Housing data set is derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. Originally published in Harrison Jr and Rubinfeld (1978) , it contains 13 attributes to predict the median property value.
- problem type: supervised regression
- response variable:
medv
median value of owner-occupied homes in USD 1000’s (i.e. 21.8, 24.5) - features: 13
- observations: 506
- objective: use property attributes to predict the median value of owner-occupied homes
# data file path
library(here)
data_path <- here("data")
# access data
boston <- readr::read_csv(here(data_path, "boston.csv"))
# initial dimension
dim(boston)
## [1] 506 16
# features
dplyr::select(boston, -cmedv)
## # A tibble: 506 × 15
## lon lat crim zn indus chas nox rm age dis rad
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -71.0 42.3 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1
## 2 -71.0 42.3 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2
## 3 -70.9 42.3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2
## 4 -70.9 42.3 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3
## 5 -70.9 42.3 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3
## 6 -70.9 42.3 0.0298 0 2.18 0 0.458 6.43 58.7 6.06 3
## 7 -70.9 42.3 0.0883 12.5 7.87 0 0.524 6.01 66.6 5.56 5
## 8 -70.9 42.3 0.145 12.5 7.87 0 0.524 6.17 96.1 5.95 5
## 9 -70.9 42.3 0.211 12.5 7.87 0 0.524 5.63 100 6.08 5
## 10 -70.9 42.3 0.170 12.5 7.87 0 0.524 6.00 85.9 6.59 5
## # ℹ 496 more rows
## # ℹ 4 more variables: tax <dbl>, ptratio <dbl>, b <dbl>, lstat <dbl>
# response variable
head(boston$cmedv)
## [1] 24.0 21.6 34.7 33.4 36.2 28.7
2.5.2 Pima Indians Diabetes
A population of women who were at least 21 years old, of Pima Indian heritage and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases and published in smith1988using , it contains 8 attributes to predict the presence of diabetes.
- problem type: supervised binary classification
- response variable:
diabetes
positive or negative response (i.e. “pos”, “neg”) - features: 8
- observations: 768
- objective: use biological attributes to predict the presence of diabetes
# access data
pima <- readr::read_csv(here(data_path, "pima.csv"))
# initial dimension
dim(pima)
## [1] 768 9
# features
dplyr::select(pima, -diabetes)
## # A tibble: 768 × 8
## pregnant glucose pressure triceps insulin mass pedigree age
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 148 72 35 0 33.6 0.627 50
## 2 1 85 66 29 0 26.6 0.351 31
## 3 8 183 64 0 0 23.3 0.672 32
## 4 1 89 66 23 94 28.1 0.167 21
## 5 0 137 40 35 168 43.1 2.29 33
## 6 5 116 74 0 0 25.6 0.201 30
## 7 3 78 50 32 88 31 0.248 26
## 8 10 115 0 0 0 35.3 0.134 29
## 9 2 197 70 45 543 30.5 0.158 53
## 10 8 125 96 0 0 0 0.232 54
## # ℹ 758 more rows
# response variable
head(pima$diabetes)
## [1] "pos" "neg" "pos" "neg" "pos" "neg"
2.5.3 Iris flowers
The Iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper (R. A. Fisher 1936) . It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. The data set consists of 50 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
- problem type: supervised multinomial classification
- response variable:
species
(i.e. “setosa”, “virginica”, “versicolor”) - features: 4
- observations: 150
- objective: use plant leaf attributes to predict the type of flower
# access data
iris <- readr::read_csv(here(data_path, "iris.csv"))
# initial dimension
dim(iris)
## [1] 150 5
# features
dplyr::select(iris, -Species)
## # A tibble: 150 × 4
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## <dbl> <dbl> <dbl> <dbl>
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
## 7 4.6 3.4 1.4 0.3
## 8 5 3.4 1.5 0.2
## 9 4.4 2.9 1.4 0.2
## 10 4.9 3.1 1.5 0.1
## # ℹ 140 more rows
# response variable
head(iris$Species)
## [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
2.5.4 Ames housing
The Ames housing data set is an alternative to the Boston housing data set and provides a more comprehensive set of home features to predict sales price. More information can be found in De Cock (2011) .
- problem type: supervised regression
- response variable:
Sale_Price
(i.e., $195,000, $215,000) - features: 80
- observations: 2,930
- objective: use property attributes to predict the sale price of a home
# access data
ames <- readr::read_csv(here(data_path, "ames.csv"))
# initial dimension
dim(ames)
## [1] 2930 81
# features
dplyr::select(ames, -Sale_Price)
## # A tibble: 2,930 × 80
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 One_Story_19… Resident… 141 31770 Pave No_A… Slightly…
## 2 One_Story_19… Resident… 80 11622 Pave No_A… Regular
## 3 One_Story_19… Resident… 81 14267 Pave No_A… Slightly…
## 4 One_Story_19… Resident… 93 11160 Pave No_A… Regular
## 5 Two_Story_19… Resident… 74 13830 Pave No_A… Slightly…
## 6 Two_Story_19… Resident… 78 9978 Pave No_A… Slightly…
## 7 One_Story_PU… Resident… 41 4920 Pave No_A… Regular
## 8 One_Story_PU… Resident… 43 5005 Pave No_A… Slightly…
## 9 One_Story_PU… Resident… 39 5389 Pave No_A… Slightly…
## 10 Two_Story_19… Resident… 60 7500 Pave No_A… Regular
## # ℹ 2,920 more rows
## # ℹ 73 more variables: Land_Contour <chr>, Utilities <chr>,
## # Lot_Config <chr>, Land_Slope <chr>, Neighborhood <chr>,
## # Condition_1 <chr>, Condition_2 <chr>, Bldg_Type <chr>,
## # House_Style <chr>, Overall_Qual <chr>, Overall_Cond <chr>,
## # Year_Built <dbl>, Year_Remod_Add <dbl>, Roof_Style <chr>,
## # Roof_Matl <chr>, Exterior_1st <chr>, Exterior_2nd <chr>, …
# response variable
head(ames$Sale_Price)
## [1] 215000 105000 172000 244000 189900 195500
2.5.5 Attrition
The employee attrition data set was originally provided by IBM Watson Analytics Lab and is a fictional data set created by IBM data scientists to explore what employee attributes influence attrition.
- problem type: supervised binomial classification
- response variable:
Attrition
(i.e., “Yes”, “No”) - features: 30
- observations: 1,470
- objective: use employee attributes to predict if they will attrit (leave the company)
# access data
attrition <- readr::read_csv(here(data_path, "attrition.csv"))
# initial dimension
dim(attrition)
## [1] 1470 31
# features
dplyr::select(attrition, -Attrition)
## # A tibble: 1,470 × 30
## Age BusinessTravel DailyRate Department DistanceFromHome Education
## <dbl> <chr> <dbl> <chr> <dbl> <chr>
## 1 41 Travel_Rarely 1102 Sales 1 College
## 2 49 Travel_Freque… 279 Research_… 8 Below_Co…
## 3 37 Travel_Rarely 1373 Research_… 2 College
## 4 33 Travel_Freque… 1392 Research_… 3 Master
## 5 27 Travel_Rarely 591 Research_… 2 Below_Co…
## 6 32 Travel_Freque… 1005 Research_… 2 College
## 7 59 Travel_Rarely 1324 Research_… 3 Bachelor
## 8 30 Travel_Rarely 1358 Research_… 24 Below_Co…
## 9 38 Travel_Freque… 216 Research_… 23 Bachelor
## 10 36 Travel_Rarely 1299 Research_… 27 Bachelor
## # ℹ 1,460 more rows
## # ℹ 24 more variables: EducationField <chr>,
## # EnvironmentSatisfaction <chr>, Gender <chr>, HourlyRate <dbl>,
## # JobInvolvement <chr>, JobLevel <dbl>, JobRole <chr>,
## # JobSatisfaction <chr>, MaritalStatus <chr>, MonthlyIncome <dbl>,
## # MonthlyRate <dbl>, NumCompaniesWorked <dbl>, OverTime <chr>,
## # PercentSalaryHike <dbl>, PerformanceRating <chr>, …
# response variable
head(attrition$Attrition)
## [1] "Yes" "No" "Yes" "No" "No" "No"
2.5.6 Hitters
This dataset was originally taken from the StatLib library which is maintained at Carnegie Mellon University. The idea was to illustrate if and how major league baseball player’s batting performance could predict their salary. The salary data were originally from Sports Illustrated, April 20, 1987. The 1986 and career statistics were obtained from The 1987 Baseball Encyclopedia Update published by Collier Books, Macmillan Publishing Company, New York. Note that the data does contain the players name but this should be removed during analysis and is not a valid feature.
- problem type: supervised regression
- response variable:
Salary
- features: 19
- observations: 322
- objective: use baseball player’s batting attributes to predict their salary.
# access data
hitters <- readr::read_csv(here(data_path, "hitters.csv"))
# initial dimension
dim(hitters)
## [1] 322 21
# features
dplyr::select(hitters, -Salary, -Player)
## # A tibble: 322 × 19
## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 293 66 1 30 29 14 1 293 66 1 30
## 2 315 81 7 24 38 39 14 3449 835 69 321
## 3 479 130 18 66 72 76 3 1624 457 63 224
## 4 496 141 20 65 78 37 11 5628 1575 225 828
## 5 321 87 10 39 42 30 2 396 101 12 48
## 6 594 169 4 74 51 35 11 4408 1133 19 501
## 7 185 37 1 23 8 21 2 214 42 1 30
## 8 298 73 0 24 24 7 3 509 108 0 41
## 9 323 81 6 26 32 8 2 341 86 6 32
## 10 401 92 17 49 66 65 13 5206 1332 253 784
## # ℹ 312 more rows
## # ℹ 8 more variables: CRBI <dbl>, CWalks <dbl>, League <chr>,
## # Division <chr>, PutOuts <dbl>, Assists <dbl>, Errors <dbl>,
## # NewLeague <chr>
# response variable
head(hitters$Salary)
## [1] NA 475.0 480.0 500.0 91.5 750.0
2.6 What You’ll Learn Next
The lessons that follow are designed to help you understand the individual sub-tasks of an ML project. The focus is to have an intuitive understanding of each discrete sub-task and algorithm. Once you understand when, where, and why these sub-tasks are performed you will be able to transfer this knowledge to other projects. The concepts you will learn include:
- Provide an overview of the ML modeling process:
- data splitting
- model fitting
- model validation and tuning
- performance measurement
- feature engineering
- Cover common supervised learners:
- linear regression
- regularized regression
- K-nearest neighbors
- decision trees
- bagging & random forests
- gradient boosting
- Cover common unsupervised learners:
- K-means clustering
- Principal component analysis
- Along the way you’ll learn about:
- each algorithm’s hyperparameters
- model interpretation
- feature importance
- and more!
2.7 Exercises
-
Identify four real-life applications of supervised and unsupervised
problems.
- Explain what makes these problems supervised versus unsupervised.
- For each problem identify the target variable (if applicable) and potential features.
-
Identify and contrast a regression problem with a classification
problem.
- What is the target variable in each problem and why would being able to accurately predict this target be beneficial to society?
- What are potential features and where could you collect this information?
- What is determining if the problem is a regression or a classification problem?
-
Identify three open source data sets suitable for machine learning
(e.g., https://bit.ly/35wKu5c).
- Explain the type of machine learning models that could be constructed from the data (e.g., supervised versus unsupervised and regression versus classification).
- What are the dimensions of the data?
- Is there a code book that explains who collected the data, why it was originally collected, and what each variable represents?
- If the data set is suitable for supervised learning, which variable(s) could be considered as a useful target? Which variable(s) could be considered as features?
- Identify examples of misuse of machine learning in society. What was the ethical concern?