Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

Fork me on GitHub





Decision Trees, Bagging, & Random Forests

with an example implementation in

Brad Boehmke

2018-12-05

Slides: bit.ly/random-forests-training

`

Introduction

`

About me

  • bradleyboehmke.github.io
  • @bradleyboehmke
  • @bradleyboehmke
  • @bradleyboehmke
  • bradleyboehmke@gmail.com

Family family

  • Dayton, OH
  • Kate, Alivia (9), Jules (6)

Professional

  • 84.51° - Data Science Enabler family

Academic

  • University of Cincinnati family
  • Air Force Institute of Technology

R Ecosystem

family

`

Decision Trees

`

Image credit: giphy

Basic Idea

Will a customer redeem a coupon

`

A ruleset model

if Loyal Customer = Yes and Household income >= $150K and Shopping mode = store then coupon redemption = Yes

`

Terminology

`

Growing the tree

Algorithms

  • ID3 (Iterative Dichotomiser 3)
  • C4.5 (successor of ID3)
  • CART (Classification And Regression Tree)
  • CHAID (CHi-squared Automatic Interaction Detector)
  • MARS: (Multivariate Adaptive Regression Splines)
  • Conditional Inference Trees
  • and more...
`

Growing the tree

Algorithms

  • ID3 (Iterative Dichotomiser 3)
  • C4.5 (successor of ID3)
  • CART (Classification And Regression Tree)
  • CHAID (CHi-squared Automatic Interaction Detector)
  • MARS: (Multivariate Adaptive Regression Splines)
  • Conditional Inference Trees
  • and more...

CART Features

  • Classification and regression trees
  • Continuous and discrete features
  • Partitioning
    • Greedy top-down
    • Strictly binary splits (tends to produce tall/deep trees)
    • Variance reduction in regression trees
    • Gini impurity in classification trees
  • Cost complexity pruning
  • (Breiman, 1984)


Most common decision tree algorithm

`

Best Binary Partitioning

Regression tree

Classification tree


Objective: Minimize disimilarity in terminal nodes

`

Best Binary Partitioning


  • Numeric feature: Numeric split to minimize loss function




  • Binary feature: Category split to minimize loss function




  • Multiclass feature: Order feature classes based on mean target variable (regression) or class proportion (classification) and choose split to minimize loss function (See ESL, section 9.2.4 for details).

`

How deep to grow a tree?

Say we have the given data generated from the underlying "truth" function



`

Depth = 1 (decision stump )

##
## Model formula:
## y ~ x
##
## Fitted party:
## [1] root
## | [2] x >= 3.07863: -0.665 (n = 255, err = 95.5)
## | [3] x < 3.07863: 0.640 (n = 245, err = 75.9)
##
## Number of inner nodes: 1
## Number of terminal nodes: 2

`

Depth = 3

##
## Model formula:
## y ~ x
##
## Fitted party:
## [1] root
## | [2] x >= 3.07863
## | | [3] x >= 3.65785
## | | | [4] x < 5.53399: -0.948 (n = 149, err = 40.0)
## | | | [5] x >= 5.53399: -0.316 (n = 60, err = 15.6)
## | | [6] x < 3.65785
## | | | [7] x < 3.20455: -0.476 (n = 10, err = 0.9)
## | | | [8] x >= 3.20455: -0.130 (n = 36, err = 9.0)
## | [9] x < 3.07863
## | | [10] x < 0.52255
## | | | [11] x < 0.28331: 0.142 (n = 23, err = 4.8)
## | | | [12] x >= 0.28331: 0.390 (n = 19, err = 5.1)
## | | [13] x >= 0.52255
## | | | [14] x >= 2.26018: 0.440 (n = 65, err = 13.7)
## | | | [15] x < 2.26018: 0.852 (n = 138, err = 36.6)
##
## Number of inner nodes: 7
## Number of terminal nodes: 8

`

Depth = 20 (complex tree )

`

Two Predictor Decision Boundaries

Classification problem: Iris data

`

Two Predictor Decision Boundaries

Classification problem: Iris data

Classification tree

`

Minimize overfitting

Must balance the depth and complexity of the tree to generalize to unseen data

2 main options:

  • Early stopping

    • Restrict tree depth
    • Restrict node size
  • Pruning

Trees have a tendency to overfit

`

Minimize overfitting: Early stopping

Limit tree depth: Stop splitting after a certain depth

`

Minimize overfitting: Early stopping

Limit tree depth: Stop splitting after a certain depth

Minimum node “size”: Do not split intermediate node which contains too few data points

`

Minimize overfitting: Pruning

  1. Grow a very large tree

Deep trees overfit

`

Minimize overfitting: Pruning

  1. Grow a very large tree

  2. Prune it back with a cost complexity parameter ( α ) × number of terminal nodes ( T ) to find an optimal subtree:

    • Very similar to lasso penalty in regularized regression
    • Large α= small tree
    • Small α= large tree
    • Find optimal α with cross validation

minimize: loss function+α|T|

Penalize depth to generalize

`

Feature/Target Pre-processing Considerations


  • Monotonic transformations (i.e. log, exp, sqrt): Not required to meet algorithm assumptions as in many parametric models; only shifts the optimal split points.

  • Removing outliers: unnecessary as the emphasis is on a single binary split and outliers are not going to bias that split.

  • One-hot encoding: unncessary and actually forces artificial relationships between categorical levels. Also, by increasing p, we reduce the probability that influential levels and variable interactions will be identified.

  • Missing values: unnecessary as most algorithms will 1) create new "missing" class for categorical variables, 2) auto-impute for continuous variables, or 3) use surrogate splits

`

Variable importance

Once we have a final model, we can find the most influential variables based on those that have the largest reduction in our loss function:

## Variable Importance
## 1 rm 23825.9224
## 2 lstat 15047.9426
## 3 dis 5385.2076
## 4 indus 5313.9748
## 5 tax 4205.2067
## 6 ptratio 4202.2984
## 7 nox 4166.1230
## 8 age 3969.2913
## 9 crim 2753.2843
## 10 zn 1604.5566
## 11 rad 1007.6588
## 12 black 408.1277
`

Variable importance

Once we have a final model, we can find the most influential variables based on those that have the largest reduction in our loss function:

`

Strengths & Weaknesses

Strengths

  • Small trees are easy to interpret

  • Trees scale well to large N (fast!!)

  • Can handle data of all types (i.e., requires little, if any, preprocessing)

  • Automatic variable selection

  • Can handle missing data

  • Completely nonparametric

`

Strengths & Weaknesses

Strengths

  • Small trees are easy to interpret

  • Trees scale well to large N (fast!!)

  • Can handle data of all types (i.e., requires little, if any, preprocessing)

  • Automatic variable selection

  • Can handle missing data

  • Completely nonparametric

Weaknesses

  • Large trees can be difficult to interpret

  • All splits depend on previous splits (i.e. capturing interactions ; additive models )

  • Trees are step functions (i.e., binary splits)

  • Single trees typically have poor predictive accuracy

  • Single trees have high variance (easy to overfit to training data)

`

Bagging

`

Image credit: unsplash

The problem with single trees

Single pruned trees are poor predictors

Single deep trees are noisy

Bagging uses this high variance to our advantage

`

Bootstrap Aggregating: wisdom of the crowd

  1. Sample records with replacement (aka "bootstrap" the training data)

  2. Fit an overgrown tree to the resampled data set

  3. Average predictions

`

Bootstrap Aggregating: wisdom of the crowd

  1. Sample records with replacement (aka "bootstrap" the training data)

  2. Fit an overgrown tree to each resampled data set

  3. Average predictions

`

Bootstrap Aggregating: wisdom of the crowd

  1. Sample records with replacement (aka "bootstrap" the training data)

  2. Fit an overgrown tree to each resampled data set

  3. Average predictions

`

Bootstrap Aggregating: wisdom of the crowd

  1. Sample records with replacement (aka "bootstrap" the training data)

  2. Fit an overgrown tree to each resampled data set

  3. Average predictions

`

Bootstrap Aggregating: wisdom of the crowd

As we add more trees...

our average prediction error reduces

Wisdom of the crowd in action

`

However, a problem remains

Bagging results in tree correlation...

which prevents bagging from optimally reducing variance of the predictive values

`

Random Forests

`

Image credit: unsplash

Idea

Split-variable randomization

  • Follow a similar bagging process but...

Bagging produces many correlated trees

`

Idea

Split-variable randomization

  • Follow a similar bagging process but...

  • each time a split is to be performed, the search for the split variable is limited to a random subset of m of the p variables

    • regression trees: m=p3
    • classification trees: m=p
    • m is commonly referred to as mtry
      • Bagging introduces randomness into the rows of the data

      • Random forest introduces randomness into the rows and columns of the data

Random Forests produce many unique trees

`

Bagging vs Random Forest

Split-variable randomization

  • Follow a similar bagging process but...

  • each time a split is to be performed, the search for the split variable is limited to a random subset of m of the p variables

  • Bagging introduces randomness into the rows of the data

  • Random forest introduces randomness into the rows and columns of the data

Combined, this provides a more diverse set of trees that almost always lowers our prediction error.

`

Out-of-bag

  • For large enough N, on average, 63.21% or the original records end up in any bootstrap sample

  • Roughly 36.79% of the observations are not used in the construction of a particular tree

  • These observations are considered out-of-bag (OOB) and can be used for efficient assessment of model performance (unstructured, but free, cross-validation)

Pro tip:

  • When N is small, OOB is less reliable than validation
  • As N increases, OOB is far more efficient than k-fold CV
  • When the number of trees are about 3x the number needed for the random forest to stabilize, the OOB error estimate is equivalent to leave-one-out cross-validation error.

`

Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

`

Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

  • Number of trees

  • mtry

  • Node size

  • Sampling scheme

  • Split rule

  • Typically have the largest impact on predictive accuracy.

  • Tend to have marginal impact on predictive accuracy but still worth exploring. Can also increase computational efficiency.

  • Generally used to increase computational efficiency

`

Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

  • Number of trees a

    • Why: stabalize the error
    • Rule of thumb: start with p×10 trees and adjust as necessary
    • Caveats:
      • small mtry and sample size values and/or larger node size values result in less correlated trees; therefore requiring more trees to converge.
      • more trees provide more robust/stable error & variable importance measures
    • Impact on computation time: increases linearly with the number of trees

a) Technically, the number of trees is not a real tuning parameter but it is important to have a sufficient number for the estimate to stabilize.

`

Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

  • Mtry

    • Why: balance low tree correlation and reasonable predictive strength
    • Rule of thumb:
      • Regression default: p3
      • Classification default: p
      • start with 5 values evenly spaced across the range from 2 to p (include the default)
    • Caveats:
      • few relevant predictors:    mtry
      • many relevant predictors: mtry
    • Impact on computation time: increases approx linearly with higher mtry values.

`

Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

  • Node size

    • Why: balance tree complexity
    • Rule of thumb:
      • Regression default: 5
      • Classification default: 1
      • start with 3 values (1, 5, 10)
    • Caveats:
      • many noisy predictors: node size
      • if higher mtry values are performing best, node size
    • Impact on computation time: increases approx exponentially with small node sizes.
      • for very large data sets: node size

`

Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

  • Node size / Required split size / Max number of nodes / Max depth

    • Alternative parameters exist that can control tree complexity; however, most preferred random forest packages (ranger, H2O) focus on node size.
    • See (Probst et al., 2018) for short discussion.
`

Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

  • Sampling scheme

    • Why: balance low tree correlation and reasonable predictive strength
    • Rule of thumb:
      • default value is 100% with replacement
      • assess 3-4 values ranging from 25%-100%
    • Caveats:
      • if you have dominating features - sample size to minimize tree correlation
      • if you have many categorical features with varying number of levels - try sampling without replacement
    • Impact on computation time:
      • for very large data sets: sample size to decrease compute time

`

Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

  • Split rule

    • Why: Balance tree correlation and run time
    • Rule of thumb:
      • Regression default: variance
      • Classification default: Gini / cross-entropy
    • Caveats:
      • Default split rules favor variables with many possible splits (continuous & categorical w/many levels)
      • Try extra random tree splitting (Geurts et al., 2006) if:
        • many categorical variables with few levels
        • need to reduce run time
    • Impact on computation time: Completely random split rule minimizes compute time since optimal split is not assessed; splits are made at random
`

Variable Importance

We have two approaches for model specific variable importance with random forests:

Impurity

  1. At each split in each tree, compute the improvement in the split-criterion
  2. Average the improvement made by each variable across all the trees that the variable is used
  3. The variables with the largest average decrease in MSE are considered most important.


Notes:

  • more trees lead to more stable vi estimates
  • smaller mtry values lead to more equal vi estimates across all variables
  • bias towards variables with many categories or numeric values

Permutation

  1. For each tree, the OOB sample is passed down the tree and the prediction accuracy is recorded.
  2. Then the values for each variable (one at a time) are randomly permuted and the accuracy is again computed.
  3. The decrease in accuracy as a result of this randomly “shaking up” of variable values is averaged over all the trees for each variable.
  4. The variables with the largest average decrease in accuracy are considered most important.

Notes:

  • more trees lead to more stable vi estimates
  • smaller mtry values lead to more equal vi estimates across all variables
  • categorical variables with many levels can have high variance vi estimates
`

Variable Importance

The two tend to produce similar results but with slight differences in rank order:

Impurity

Permutation

`














Implementation

`

Image credit: unsplash

Prereqs

Random Forest

  • h2o: n>>p
  • ranger: p>>n     (for this presentation I will demo ranger)
# general EDA
library(dplyr)
library(ggplot2)
# machine learning
library(ranger)
library(h2o)
library(rsample) # data splitting
library(vip) # visualize feature importance
library(pdp) # visualize feature effects
`

Prereqs

Random Forest

  • h2o: n>>p
  • ranger: p>>n     (for this presentation I will demo ranger)
# general EDA
library(dplyr)
library(ggplot2)
# machine learning
library(ranger)
library(h2o)
library(rsample) # data splitting
library(vip) # visualize feature importance
library(pdp) # visualize feature effects

Data

# Create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.
# Use set.seed for reproducibility
set.seed(8451)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7, strata = "Sale_Price")
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
`

Initial Implementation - training

  • formula: formula specification

  • data: training data

  • num.trees: number of trees in the forest

  • mtry: randomly selected predictor variables at each split. Default is floor(number of features) ; however, for regression problems the preferred mtry to start with is floor(number of features3)=floor(803)=26

  • respect.unordered.factors: specifies how to treat unordered factor variables. We recommend setting this to "order" (See ESL, section 9.2.4 for details).

  • seed: because this is a random algorithm, you will set the seed to get reproducible results

# number of features
features <- setdiff(names(ames_train), "Sale_Price")
# perform basic random forest model
fit_default <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = length(features) * 10,
mtry = floor(length(features) / 3),
respect.unordered.factors = 'order',
verbose = FALSE,
seed = 123
)
`

Initial Implementation - results

# look at results
fit_default
## Ranger result
##
## Call:
## ranger(formula = Sale_Price ~ ., data = ames_train, num.trees = length(features) * 10, mtry = floor(length(features)/3), respect.unordered.factors = "order", verbose = FALSE, seed = 123)
##
## Type: Regression
## Number of trees: 800
## Sample size: 2054
## Number of independent variables: 80
## Mtry: 26
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 620208087
## R squared (OOB): 0.8957654
# compute RMSE (RMSE = square root of MSE)
sqrt(fit_default$prediction.error)
## [1] 24903.98

Default results are based on OOB errors

`

Characteristics to Consider

What we do next should be driven by attributes of our data:

`

Characteristics to Consider

What we do next should be driven by attributes of our data:

  • Half our variables are numeric
  • Half are categorical variables with moderate number of levels
  • Likely will favor variance split rule
  • May benefit from sampling w/o replacement
ames_train %>%
summarise_if(is.factor, n_distinct) %>%
gather() %>%
arrange(desc(value))
## # A tibble: 46 x 2
## key value
## <chr> <int>
## 1 Neighborhood 27
## 2 Exterior_1st 16
## 3 Exterior_2nd 16
## 4 MS_SubClass 15
## 5 Overall_Qual 10
## 6 Sale_Type 10
## 7 Condition_1 9
## 8 Overall_Cond 9
## 9 House_Style 8
## 10 Functional 8
## # ... with 36 more rows
  • We have highly correlated data (both btwn features and with target)
  • May favor lower mtry and
  • lower node size to help decorrelate the trees

cor_matrix <- ames_train %>%
mutate_if(is.factor, as.numeric) %>%
cor()
# feature correlation
data_frame(
row = rownames(cor_matrix)[row(cor_matrix)[upper.tri(cor_matrix)]],
col = colnames(cor_matrix)[col(cor_matrix)[upper.tri(cor_matrix)]],
corr = cor_matrix[upper.tri(cor_matrix)]
) %>%
arrange(desc(abs(corr)))
## # A tibble: 3,240 x 3
## row col corr
## <chr> <chr> <dbl>
## 1 BsmtFin_Type_1 BsmtFin_SF_1 1
## 2 Garage_Cars Garage_Area 0.888
## 3 Exterior_1st Exterior_2nd 0.856
## 4 Gr_Liv_Area TotRms_AbvGrd 0.802
## 5 Overall_Qual Sale_Price 0.800
## 6 Total_Bsmt_SF First_Flr_SF 0.789
## 7 MS_SubClass Bldg_Type 0.719
## 8 House_Style Second_Flr_SF 0.713
## 9 BsmtFin_Type_2 BsmtFin_SF_2 -0.702
## 10 Gr_Liv_Area Sale_Price 0.694
## # ... with 3,230 more rows
# target correlation
data_frame(
row = rownames(cor_matrix)[row(cor_matrix)[upper.tri(cor_matrix)]],
col = colnames(cor_matrix)[col(cor_matrix)[upper.tri(cor_matrix)]],
corr = cor_matrix[upper.tri(cor_matrix)]
) %>% filter(col == "Sale_Price") %>%
arrange(desc(abs(corr)))
## # A tibble: 78 x 3
## row col corr
## <chr> <chr> <dbl>
## 1 Overall_Qual Sale_Price 0.800
## 2 Gr_Liv_Area Sale_Price 0.694
## 3 Exter_Qual Sale_Price -0.662
## 4 Garage_Cars Sale_Price 0.655
## 5 Garage_Area Sale_Price 0.652
## 6 Total_Bsmt_SF Sale_Price 0.630
## 7 Kitchen_Qual Sale_Price -0.625
## 8 First_Flr_SF Sale_Price 0.617
## 9 Bsmt_Qual Sale_Price -0.575
## 10 Year_Built Sale_Price 0.571
## # ... with 68 more rows
`

Tuning

But before we tune, do we have enough s?

  • Some pkgs provide OOB error for each tree
  • ranger only provides overall OOB
# number of features
n_features <- ncol(ames_train) - 1
# tuning grid
tuning_grid <- expand.grid(
trees = seq(10, 1000, by = 20),
rmse = NA
)
for(i in seq_len(nrow(tuning_grid))) {
fit <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = tuning_grid$trees[i],
mtry = floor(n_features / 3),
respect.unordered.factors = 'order',
verbose = FALSE,
seed = 123
)
tuning_grid$rmse[i] <- sqrt(fit$prediction.error)
}
  • using p×10=800 trees is sufficient
  • may increase if we decrease mtry or sample size
ggplot(tuning_grid, aes(trees, rmse)) +
geom_line(size = 1)

`

Tuning

Tuning grid

  • lower end of mtry range due to correlation
  • lower end of node size range due to correlation
  • sampling w/o replacement due to categorical features

hyper_grid <- expand.grid(
mtry = floor(n_features * c(.05, .15, .25, .333, .4)),
min.node.size = c(1, 3, 5),
replace = c(TRUE, FALSE),
sample.fraction = c(.5, .63, .8),
rmse = NA
)
# number of hyperparameter combinations
nrow(hyper_grid)
## [1] 90
head(hyper_grid)
## mtry min.node.size replace sample.fraction rmse
## 1 4 1 TRUE 0.5 NA
## 2 12 1 TRUE 0.5 NA
## 3 20 1 TRUE 0.5 NA
## 4 26 1 TRUE 0.5 NA
## 5 32 1 TRUE 0.5 NA
## 6 4 3 TRUE 0.5 NA

Grid search execution

  • This search grid took ~2.5 minutes
  • caret provides grid search
  • For larger data, use H2O's random grid search with early stopping
for(i in seq_len(nrow(hyper_grid))) {
# fit model for ith hyperparameter combination
fit <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 1000,
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$min.node.size[i],
replace = hyper_grid$replace[i],
sample.fraction = hyper_grid$sample.fraction[i],
verbose = FALSE,
seed = 123,
respect.unordered.factors = 'order',
)
# export OOB error
hyper_grid$rmse[i] <- sqrt(fit$prediction.error)
}
`

Tuning results

Our top 10 models:

  • have ~1% or higher performance improvement over the default model
  • sample w/o replacement
  • primarily include higher sampling
  • primarily use mtry = 20 or 26
  • node size appears non-influential

I would follow this up with an additional grid search that focuses on:

  • mtry values around 15, 18, 21, 24
  • sample fraction around 63%, 70%, 75%, 80%

using too high of sampling fraction without replacement runs the risk of overfitting to your training data!

default_rmse <- sqrt(fit_default$prediction.error)
hyper_grid %>%
arrange(rmse) %>%
mutate(perc_gain = (default_rmse - rmse) / default_rmse * 100) %>%
head(10)
## mtry min.node.size replace sample.fraction rmse perc_gain
## 1 20 1 FALSE 0.80 24474.19 1.7257766
## 2 20 5 FALSE 0.80 24485.64 1.6798126
## 3 20 3 FALSE 0.80 24555.24 1.4003421
## 4 26 3 FALSE 0.80 24612.76 1.1693799
## 5 20 1 FALSE 0.63 24613.27 1.1673219
## 6 26 1 FALSE 0.80 24615.42 1.1586911
## 7 26 5 FALSE 0.80 24617.94 1.1485760
## 8 20 3 FALSE 0.63 24642.72 1.0490463
## 9 12 1 FALSE 0.80 24659.98 0.9797534
## 10 12 3 FALSE 0.80 24702.53 0.8089133
`

Feature Importance

Once you find your optimal model:

  • re-run with the respective hyperparameters
  • include importance parameter
  • crank up the # of trees to ensure stable vi estimates
fit_final <- ranger(
formula = Sale_Price ~ .,
data = ames_train,
num.trees = 2000,
mtry = 20,
min.node.size = 1,
sample.fraction = .8,
replace = FALSE,
importance = 'permutation',
respect.unordered.factors = 'order',
verbose = FALSE,
seed = 123
)
vip(fit_final, num_features = 15)

`

Feature Effects

Partial dependence plots (PDPs), Individual Conditional Expectation (ICE) curves, and other approaches allow us to understand how important variables influence our model's predictions:

PDP: Overall Home Quality

fit_final %>%
partial(pred.var = "Overall_Qual", train = as.data.frame(ames_train)) %>%
autoplot()

ICE: Overall Home Quality

fit_final %>%
partial(pred.var = "Overall_Qual", train = as.data.frame(ames_train), ice = TRUE) %>%
autoplot(alpha = 0.05, center = TRUE)

`

Feature Effects

Partial dependence plots (PDPs), Individual Conditional Expectation (ICE) curves, and other approaches allow us to understand how important variables influence our model's predictions:

PDP: Above Ground SqFt

fit_final %>%
partial(pred.var = "Gr_Liv_Area", train = as.data.frame(ames_train)) %>%
autoplot()

ICE: Above Ground SqFt

fit_final %>%
partial(pred.var = "Gr_Liv_Area", train = as.data.frame(ames_train), ice = TRUE) %>%
autoplot(alpha = 0.05, center = TRUE)

`

Feature Effects

Interaction between two influential variables:

fit_final %>%
partial(
pred.var = c("Gr_Liv_Area", "Year_Built"),
train = as.data.frame(ames_train)
) %>%
plotPartial(
zlab = "Sale_Price",
levelplot = FALSE,
drape = TRUE,
colorkey = FALSE,
screen = list(z = 50, x = -60)
)

Read more about machine learning interpretation here

`

Random Forest Summary

Strengths

  • Competitive performance.
  • Remarkably good "out-of-the box" (very little tuning required).
  • Built-in validation set (don't need to sacrifice data for extra validation).
  • Typically does not overfit.
  • Robust to outliers.
  • Handles missing data (imputation not required).
  • Provide automatic feature selection.
  • Minimal preprocessing required.
`

Random Forest Summary

Strengths

  • Competitive performance.
  • Remarkably good "out-of-the box" (very little tuning required).
  • Built-in validation set (don't need to sacrifice data for extra validation).
  • Typically does not overfit.
  • Robust to outliers.
  • Handles missing data (imputation not required).
  • Provide automatic feature selection.
  • Minimal preprocessing required.

Weaknesses

  • Although accurate, often cannot compete with the accuracy of advanced boosting algorithms.
  • Can become slow on large data sets.
  • Less interpretable (although this is easily addressed with various tools such as variable importance, partial dependence plots, LIME, etc.).
`

Random Forest Summary



"Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem."

--- Leo Breiman

`

Learning More

`













Questions?

`

Introduction

`
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow