Decision Trees
Image credit: giphy
Will a customer redeem a coupon
if Loyal Customer = Yes and Household income >= $150K and Shopping mode = store then coupon redemption = Yes
Most common decision tree algorithm
Regression tree
Classification tree
Objective: Minimize disimilarity in terminal nodes
Say we have the given data generated from the underlying "truth" function
## ## Model formula:## y ~ x## ## Fitted party:## [1] root## | [2] x >= 3.07863: -0.665 (n = 255, err = 95.5)## | [3] x < 3.07863: 0.640 (n = 245, err = 75.9)## ## Number of inner nodes: 1## Number of terminal nodes: 2
## ## Model formula:## y ~ x## ## Fitted party:## [1] root## | [2] x >= 3.07863## | | [3] x >= 3.65785## | | | [4] x < 5.53399: -0.948 (n = 149, err = 40.0)## | | | [5] x >= 5.53399: -0.316 (n = 60, err = 15.6)## | | [6] x < 3.65785## | | | [7] x < 3.20455: -0.476 (n = 10, err = 0.9)## | | | [8] x >= 3.20455: -0.130 (n = 36, err = 9.0)## | [9] x < 3.07863## | | [10] x < 0.52255## | | | [11] x < 0.28331: 0.142 (n = 23, err = 4.8)## | | | [12] x >= 0.28331: 0.390 (n = 19, err = 5.1)## | | [13] x >= 0.52255## | | | [14] x >= 2.26018: 0.440 (n = 65, err = 13.7)## | | | [15] x < 2.26018: 0.852 (n = 138, err = 36.6)## ## Number of inner nodes: 7## Number of terminal nodes: 8
Must balance the depth and complexity of the tree to generalize to unseen data
2 main options:
Early stopping
Pruning
Trees have a tendency to overfit
Limit tree depth: Stop splitting after a certain depth
Limit tree depth: Stop splitting after a certain depth
Minimum node “size”: Do not split intermediate node which contains too few data points
Deep trees overfit
Grow a very large tree
Prune it back with a cost complexity parameter ( α ) × number of terminal nodes ( T ) to find an optimal subtree:
minimize: loss function+α|T|
Penalize depth to generalize
Monotonic transformations (i.e. log, exp, sqrt): Not required to meet algorithm assumptions as in many parametric models; only shifts the optimal split points.
Removing outliers: unnecessary as the emphasis is on a single binary split and outliers are not going to bias that split.
One-hot encoding: unncessary and actually forces artificial relationships between categorical levels. Also, by increasing p, we reduce the probability that influential levels and variable interactions will be identified.
Missing values: unnecessary as most algorithms will 1) create new "missing" class for categorical variables, 2) auto-impute for continuous variables, or 3) use surrogate splits
Once we have a final model, we can find the most influential variables based on those that have the largest reduction in our loss function:
## Variable Importance## 1 rm 23825.9224## 2 lstat 15047.9426## 3 dis 5385.2076## 4 indus 5313.9748## 5 tax 4205.2067## 6 ptratio 4202.2984## 7 nox 4166.1230## 8 age 3969.2913## 9 crim 2753.2843## 10 zn 1604.5566## 11 rad 1007.6588## 12 black 408.1277
Once we have a final model, we can find the most influential variables based on those that have the largest reduction in our loss function:
Small trees are easy to interpret
Trees scale well to large N (fast!!)
Can handle data of all types (i.e., requires little, if any, preprocessing)
Automatic variable selection
Can handle missing data
Completely nonparametric
Small trees are easy to interpret
Trees scale well to large N (fast!!)
Can handle data of all types (i.e., requires little, if any, preprocessing)
Automatic variable selection
Can handle missing data
Completely nonparametric
Large trees can be difficult to interpret
All splits depend on previous splits (i.e. capturing interactions ; additive models )
Trees are step functions (i.e., binary splits)
Single trees typically have poor predictive accuracy
Single trees have high variance (easy to overfit to training data)
Bagging
Image credit: unsplash
Single pruned trees are poor predictors
Single deep trees are noisy
Bagging uses this high variance to our advantage
Sample records with replacement (aka "bootstrap" the training data)
Fit an overgrown tree to the resampled data set
Average predictions
Sample records with replacement (aka "bootstrap" the training data)
Fit an overgrown tree to each resampled data set
Average predictions
Sample records with replacement (aka "bootstrap" the training data)
Fit an overgrown tree to each resampled data set
Average predictions
Sample records with replacement (aka "bootstrap" the training data)
Fit an overgrown tree to each resampled data set
Average predictions
As we add more trees...
our average prediction error reduces
Wisdom of the crowd in action
Bagging results in tree correlation...
which prevents bagging from optimally reducing variance of the predictive values
Random Forests
Image credit: unsplash
Bagging produces many correlated trees
Follow a similar bagging process but...
each time a split is to be performed, the search for the split variable is limited to a random subset of m of the p variables
Bagging introduces randomness into the rows of the data
Random forest introduces randomness into the rows and columns of the data
Random Forests produce many unique trees
Follow a similar bagging process but...
each time a split is to be performed, the search for the split variable is limited to a random subset of m of the p variables
Bagging introduces randomness into the rows of the data
Random forest introduces randomness into the rows and columns of the data
Combined, this provides a more diverse set of trees that almost always lowers our prediction error.
For large enough N, on average, 63.21% or the original records end up in any bootstrap sample
Roughly 36.79% of the observations are not used in the construction of a particular tree
These observations are considered out-of-bag (OOB) and can be used for efficient assessment of model performance (unstructured, but free, cross-validation)
Pro tip:
Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.
Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.
Number of trees
mtry
Node size
Sampling scheme
Split rule
Typically have the largest impact on predictive accuracy.
Tend to have marginal impact on predictive accuracy but still worth exploring. Can also increase computational efficiency.
Generally used to increase computational efficiency
Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.
Number of trees a
a) Technically, the number of trees is not a real tuning parameter but it is important to have a sufficient number for the estimate to stabilize.
Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.
Mtry
Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.
Node size
Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.
Node size / Required split size / Max number of nodes / Max depth
Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.
Sampling scheme
Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.
Split rule
We have two approaches for model specific variable importance with random forests:
Impurity
Notes:
Permutation
Notes:
The two tend to produce similar results but with slight differences in rank order:
Impurity
Permutation
Implementation
Image credit: unsplash
Random Forest
# general EDAlibrary(dplyr)library(ggplot2)# machine learninglibrary(ranger)library(h2o)library(rsample) # data splittinglibrary(vip) # visualize feature importance library(pdp) # visualize feature effects
Random Forest
# general EDAlibrary(dplyr)library(ggplot2)# machine learninglibrary(ranger)library(h2o)library(rsample) # data splittinglibrary(vip) # visualize feature importance library(pdp) # visualize feature effects
Data
# Create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.# Use set.seed for reproducibilityset.seed(8451)ames_split <- initial_split(AmesHousing::make_ames(), prop = .7, strata = "Sale_Price")ames_train <- training(ames_split)ames_test <- testing(ames_split)
formula
: formula specification
data
: training data
num.trees
: number of trees in the forest
mtry
: randomly selected predictor variables at each split. Default is floor(√number of features) ; however, for regression problems the preferred mtry
to start with is floor(number of features3)=floor(803)=26
respect.unordered.factors
: specifies how to treat unordered factor variables. We recommend setting this to "order" (See ESL, section 9.2.4 for details).
seed
: because this is a random algorithm, you will set the seed to get reproducible results
# number of featuresfeatures <- setdiff(names(ames_train), "Sale_Price")# perform basic random forest modelfit_default <- ranger( formula = Sale_Price ~ ., data = ames_train, num.trees = length(features) * 10, mtry = floor(length(features) / 3), respect.unordered.factors = 'order', verbose = FALSE, seed = 123 )
# look at resultsfit_default## Ranger result## ## Call:## ranger(formula = Sale_Price ~ ., data = ames_train, num.trees = length(features) * 10, mtry = floor(length(features)/3), respect.unordered.factors = "order", verbose = FALSE, seed = 123) ## ## Type: Regression ## Number of trees: 800 ## Sample size: 2054 ## Number of independent variables: 80 ## Mtry: 26 ## Target node size: 5 ## Variable importance mode: none ## Splitrule: variance ## OOB prediction error (MSE): 620208087 ## R squared (OOB): 0.8957654# compute RMSE (RMSE = square root of MSE)sqrt(fit_default$prediction.error)## [1] 24903.98
Default results are based on OOB errors
What we do next should be driven by attributes of our data:
What we do next should be driven by attributes of our data:
ames_train %>% summarise_if(is.factor, n_distinct) %>% gather() %>% arrange(desc(value))## # A tibble: 46 x 2## key value## <chr> <int>## 1 Neighborhood 27## 2 Exterior_1st 16## 3 Exterior_2nd 16## 4 MS_SubClass 15## 5 Overall_Qual 10## 6 Sale_Type 10## 7 Condition_1 9## 8 Overall_Cond 9## 9 House_Style 8## 10 Functional 8## # ... with 36 more rows
cor_matrix <- ames_train %>% mutate_if(is.factor, as.numeric) %>% cor()# feature correlationdata_frame( row = rownames(cor_matrix)[row(cor_matrix)[upper.tri(cor_matrix)]], col = colnames(cor_matrix)[col(cor_matrix)[upper.tri(cor_matrix)]], corr = cor_matrix[upper.tri(cor_matrix)] ) %>% arrange(desc(abs(corr)))## # A tibble: 3,240 x 3## row col corr## <chr> <chr> <dbl>## 1 BsmtFin_Type_1 BsmtFin_SF_1 1 ## 2 Garage_Cars Garage_Area 0.888## 3 Exterior_1st Exterior_2nd 0.856## 4 Gr_Liv_Area TotRms_AbvGrd 0.802## 5 Overall_Qual Sale_Price 0.800## 6 Total_Bsmt_SF First_Flr_SF 0.789## 7 MS_SubClass Bldg_Type 0.719## 8 House_Style Second_Flr_SF 0.713## 9 BsmtFin_Type_2 BsmtFin_SF_2 -0.702## 10 Gr_Liv_Area Sale_Price 0.694## # ... with 3,230 more rows# target correlationdata_frame( row = rownames(cor_matrix)[row(cor_matrix)[upper.tri(cor_matrix)]], col = colnames(cor_matrix)[col(cor_matrix)[upper.tri(cor_matrix)]], corr = cor_matrix[upper.tri(cor_matrix)]) %>% filter(col == "Sale_Price") %>% arrange(desc(abs(corr)))## # A tibble: 78 x 3## row col corr## <chr> <chr> <dbl>## 1 Overall_Qual Sale_Price 0.800## 2 Gr_Liv_Area Sale_Price 0.694## 3 Exter_Qual Sale_Price -0.662## 4 Garage_Cars Sale_Price 0.655## 5 Garage_Area Sale_Price 0.652## 6 Total_Bsmt_SF Sale_Price 0.630## 7 Kitchen_Qual Sale_Price -0.625## 8 First_Flr_SF Sale_Price 0.617## 9 Bsmt_Qual Sale_Price -0.575## 10 Year_Built Sale_Price 0.571## # ... with 68 more rows
But before we tune, do we have enough s?
# number of featuresn_features <- ncol(ames_train) - 1# tuning gridtuning_grid <- expand.grid( trees = seq(10, 1000, by = 20), rmse = NA)for(i in seq_len(nrow(tuning_grid))) { fit <- ranger( formula = Sale_Price ~ ., data = ames_train, num.trees = tuning_grid$trees[i], mtry = floor(n_features / 3), respect.unordered.factors = 'order', verbose = FALSE, seed = 123 ) tuning_grid$rmse[i] <- sqrt(fit$prediction.error)}
ggplot(tuning_grid, aes(trees, rmse)) + geom_line(size = 1)
Tuning grid
hyper_grid <- expand.grid( mtry = floor(n_features * c(.05, .15, .25, .333, .4)), min.node.size = c(1, 3, 5), replace = c(TRUE, FALSE), sample.fraction = c(.5, .63, .8), rmse = NA)# number of hyperparameter combinationsnrow(hyper_grid)## [1] 90head(hyper_grid)## mtry min.node.size replace sample.fraction rmse## 1 4 1 TRUE 0.5 NA## 2 12 1 TRUE 0.5 NA## 3 20 1 TRUE 0.5 NA## 4 26 1 TRUE 0.5 NA## 5 32 1 TRUE 0.5 NA## 6 4 3 TRUE 0.5 NA
Grid search execution
for(i in seq_len(nrow(hyper_grid))) { # fit model for ith hyperparameter combination fit <- ranger( formula = Sale_Price ~ ., data = ames_train, num.trees = 1000, mtry = hyper_grid$mtry[i], min.node.size = hyper_grid$min.node.size[i], replace = hyper_grid$replace[i], sample.fraction = hyper_grid$sample.fraction[i], verbose = FALSE, seed = 123, respect.unordered.factors = 'order', ) # export OOB error hyper_grid$rmse[i] <- sqrt(fit$prediction.error)}
Our top 10 models:
I would follow this up with an additional grid search that focuses on:
using too high of sampling fraction without replacement runs the risk of overfitting to your training data!
default_rmse <- sqrt(fit_default$prediction.error)hyper_grid %>% arrange(rmse) %>% mutate(perc_gain = (default_rmse - rmse) / default_rmse * 100) %>% head(10)## mtry min.node.size replace sample.fraction rmse perc_gain## 1 20 1 FALSE 0.80 24474.19 1.7257766## 2 20 5 FALSE 0.80 24485.64 1.6798126## 3 20 3 FALSE 0.80 24555.24 1.4003421## 4 26 3 FALSE 0.80 24612.76 1.1693799## 5 20 1 FALSE 0.63 24613.27 1.1673219## 6 26 1 FALSE 0.80 24615.42 1.1586911## 7 26 5 FALSE 0.80 24617.94 1.1485760## 8 20 3 FALSE 0.63 24642.72 1.0490463## 9 12 1 FALSE 0.80 24659.98 0.9797534## 10 12 3 FALSE 0.80 24702.53 0.8089133
Once you find your optimal model:
importance
parameterfit_final <- ranger( formula = Sale_Price ~ ., data = ames_train, num.trees = 2000, mtry = 20, min.node.size = 1, sample.fraction = .8, replace = FALSE, importance = 'permutation', respect.unordered.factors = 'order', verbose = FALSE, seed = 123 )
vip(fit_final, num_features = 15)
Partial dependence plots (PDPs), Individual Conditional Expectation (ICE) curves, and other approaches allow us to understand how important variables influence our model's predictions:
PDP: Overall Home Quality
fit_final %>% partial(pred.var = "Overall_Qual", train = as.data.frame(ames_train)) %>% autoplot()
ICE: Overall Home Quality
fit_final %>% partial(pred.var = "Overall_Qual", train = as.data.frame(ames_train), ice = TRUE) %>% autoplot(alpha = 0.05, center = TRUE)
Partial dependence plots (PDPs), Individual Conditional Expectation (ICE) curves, and other approaches allow us to understand how important variables influence our model's predictions:
PDP: Above Ground SqFt
fit_final %>% partial(pred.var = "Gr_Liv_Area", train = as.data.frame(ames_train)) %>% autoplot()
ICE: Above Ground SqFt
fit_final %>% partial(pred.var = "Gr_Liv_Area", train = as.data.frame(ames_train), ice = TRUE) %>% autoplot(alpha = 0.05, center = TRUE)
Interaction between two influential variables:
fit_final %>% partial( pred.var = c("Gr_Liv_Area", "Year_Built"), train = as.data.frame(ames_train) ) %>% plotPartial( zlab = "Sale_Price", levelplot = FALSE, drape = TRUE, colorkey = FALSE, screen = list(z = 50, x = -60) )
Read more about machine learning interpretation here
"Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem."
--- Leo Breiman
Questions?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |