Decision Trees, Bagging, & Random Forests

class: title-slide 
<a href="https://github.com/bradleyboehmke/random-forest-training"><img style="position: absolute; top: 0; right: 0; border: 0;" src="https://s3.amazonaws.com/github/ribbons/forkme_right_darkblue_121621.png" alt="Fork me on GitHub"></a>

# .font130[Decision Trees, Bagging, & Random Forests]

## .font130[with an example implementation in ]

### Brad Boehmke
### 2018-12-05
### Slides: [bit.ly/random-forests-training](http://bit.ly/random-forests-training)

---

# Introduction

---

# About me

* <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 496 512"><path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"/></svg> bradleyboehmke.github.io
* <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 496 512"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> @bradleyboehmke
* <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 512 512"><path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"/></svg> @bradleyboehmke
* <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 448 512"><path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"/></svg> @bradleyboehmke
* <svg style="height:0.8em;top:.04em;position:relative;fill:steelblue;" viewBox="0 0 512 512"><path d="M502.3 190.8c3.9-3.1 9.7-.2 9.7 4.7V400c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V195.6c0-5 5.7-7.8 9.7-4.7 22.4 17.4 52.1 39.5 154.1 113.6 21.1 15.4 56.7 47.8 92.2 47.6 35.7.3 72-32.8 92.3-47.6 102-74.1 131.6-96.3 154-113.7zM256 320c23.2.4 56.6-29.2 73.4-41.4 132.7-96.3 142.8-104.7 173.4-128.7 5.8-4.5 9.2-11.5 9.2-18.9v-19c0-26.5-21.5-48-48-48H48C21.5 64 0 85.5 0 112v19c0 7.4 3.4 14.3 9.2 18.9 30.6 23.9 40.7 32.4 173.4 128.7 16.8 12.2 50.2 41.8 73.4 41.4z"/></svg> bradleyboehmke@gmail.com

]

#### Family <img src="images/family.png" align="right" alt="family" width="130" />

* Dayton, OH
* Kate, Alivia (9), Jules (6)

#### Professional

* 84.51° - Data Science Enabler <img src="images/logo8451.jpg" align="right" alt="family" width="150" />

#### Academic

* University of Cincinnati <img src="images/uc.png" align="right" alt="family" width="100" />
* Air Force Institute of Technology

#### R Ecosystem

]

---
class: clear, center, middle

.font300.white[Decision Trees]

???

Image credit: [giphy](https://giphy.com/gifs/tree-U85Z0lxOwDoys?utm_source=media-link&utm_medium=landing&utm_campaign=Media%20Links&utm_term=)

---

# Basic Idea

---

# A .red[ruleset] model

.font90[`if Loyal Customer = Yes and Household income >= $150K and Shopping mode = store then coupon redemption = Yes`]

---

# Terminology

---

# Growing the tree

### Algorithms

- ID3 (Iterative Dichotomiser 3)
- C4.5 (successor of ID3)
- CART (Classification And Regression Tree)
- CHAID (CHi-squared Automatic Interaction Detector)
- MARS: (Multivariate Adaptive Regression Splines)
- Conditional Inference Trees
- and more...

]

---

# Growing the tree

### Algorithms

- ID3 (Iterative Dichotomiser 3)
- C4.5 (successor of ID3)
- .bold.blue[CART (Classification And Regression Tree)]
- CHAID (CHi-squared Automatic Interaction Detector)
- MARS: (Multivariate Adaptive Regression Splines)
- Conditional Inference Trees
- and more...

]

### CART Features

- Classification and regression trees
- Continuous and discrete features
- Partitioning
 - Greedy top-down
 - Strictly binary splits (tends to produce tall/deep trees)
 - Variance reduction in regression trees
 - Gini impurity in classification trees
- Cost complexity pruning 
- [(Breiman, 1984)](https://www.taylorfrancis.com/books/9781351460491)

]

---

# Best .red[Binary] Partitioning

.center.font130.bold[Regression tree]

]

.center.font130.bold[Classification tree]

]

---

# Best .red[Binary] Partitioning

.pull-left[
 
- __Numeric feature__: Numeric split to minimize loss function
 
- __Binary feature__: Category split to minimize loss function
 
- __Multiclass feature__: Order feature classes based on mean target variable (regression) or class proportion (classification) and choose split to minimize loss function ([See ESL, section 9.2.4 for details](https://web.stanford.edu/~hastie/ElemStatLearn/)).

]

]

---

# How deep to grow a tree?

Say we have the given data generated from the underlying .blue["truth"] function

---

# Depth = 1 (decision .red[stump] <img src="images/stump.png" style="height:1em; width:auto; "/>)

```
## 
## Model formula:
## y ~ x
## 
## Fitted party:
## [1] root
## | [2] x >= 3.07863: -0.665 (n = 255, err = 95.5)
## | [3] x < 3.07863: 0.640 (n = 245, err = 75.9)
## 
## Number of inner nodes: 1
## Number of terminal nodes: 2
```

]

]
]

---

# Depth = 3 <img src="images/small-tree-icon.png" style="height:1em; width:auto; "/>

```
## 
## Model formula:
## y ~ x
## 
## Fitted party:
## [1] root
## | [2] x >= 3.07863
## | | [3] x >= 3.65785
## | | | [4] x < 5.53399: -0.948 (n = 149, err = 40.0)
## | | | [5] x >= 5.53399: -0.316 (n = 60, err = 15.6)
## | | [6] x < 3.65785
## | | | [7] x < 3.20455: -0.476 (n = 10, err = 0.9)
## | | | [8] x >= 3.20455: -0.130 (n = 36, err = 9.0)
## | [9] x < 3.07863
## | | [10] x < 0.52255
## | | | [11] x < 0.28331: 0.142 (n = 23, err = 4.8)
## | | | [12] x >= 0.28331: 0.390 (n = 19, err = 5.1)
## | | [13] x >= 0.52255
## | | | [14] x >= 2.26018: 0.440 (n = 65, err = 13.7)
## | | | [15] x < 2.26018: 0.852 (n = 138, err = 36.6)
## 
## Number of inner nodes: 7
## Number of terminal nodes: 8
```

]

]
]

---

# Depth = 20 (.red[complex tree] <img src="images/large-tree-icon.png" style="height:1em; width:auto; "/>)

]

]
]

---

# .red[Two Predictor] Decision Boundaries

### Classification problem: Iris data

]

---

# .red[Two Predictor] Decision Boundaries

### Classification problem: Iris data

]

### Classification tree

]

---

# Minimize overfitting

2 main options:

* Early stopping 
   * Restrict tree depth
   * Restrict node size

* Pruning

]

]

---

# Minimize overfitting: Early stopping

]

]

---

# Minimize overfitting: Pruning

1. .font120[Grow a very large tree]

]

]

---

# Minimize overfitting: Pruning

1. Grow a very large tree

2. Prune it back with a _.red[cost complexity parameter]_ ( `$\alpha$` ) `$\times$` number of terminal nodes ( `$T$` ) to find an optimal subtree:
  - Very similar to lasso penalty in regularized regression
  - Large `$\alpha =$` small tree
  - Small `$\alpha =$` large tree
  - Find optimal `$\alpha$` with cross validation

$$ \text{minimize: loss function} + \alpha |T|  $$

]

]

---

# Feature/Target Pre-processing Considerations

* __Monotonic transformations__ (i.e. log, exp, sqrt): .blue[Not required] to meet algorithm assumptions as in many parametric models; only shifts the optimal split points.

* __Removing outliers__: .blue[unnecessary] as the emphasis is on a single binary split and outliers are not going to bias that split.

* __One-hot encoding__: .blue[unncessary] and actually forces artificial relationships between categorical levels.  Also, by increasing `$p$`, we reduce the probability that influential levels and variable interactions will be identified.

* __Missing values__: .blue[unnecessary] as most algorithms will 1) create new "missing" class for categorical variables, 2) auto-impute for continuous variables, or 3) use *surrogate* splits

---

# Variable importance

Once we have a final model, we can find the most .red[influential variables] based on those that have the .red[largest reduction] in our loss function:

]

```
##    Variable Importance
## 1        rm 23825.9224
## 2     lstat 15047.9426
## 3       dis  5385.2076
## 4     indus  5313.9748
## 5       tax  4205.2067
## 6   ptratio  4202.2984
## 7       nox  4166.1230
## 8       age  3969.2913
## 9      crim  2753.2843
## 10       zn  1604.5566
## 11      rad  1007.6588
## 12    black   408.1277
```

]

---

# Variable importance

Once we have a final model, we can find the most .red[influential variables] based on those that have the .red[largest reduction] in our loss function:

]

]

---

# Strengths & Weaknesses

### Strengths <img src="https://emojis.slackmojis.com/emojis/images/1471045870/910/rock.gif?1471045870" style="height:1em; width:auto; "/>

- .green[Small trees are easy to interpret]

- .green[Trees scale well to large _N_] (fast!!)

- .green[Can handle data of all types] (i.e., requires little, if any, preprocessing)

- .green[Automatic variable selection]

- .green[Can handle missing data]

- .green[Completely nonparametric]

]

### Weaknesses <img src="https://emojis.slackmojis.com/emojis/images/1471045885/967/wtf.gif?1471045885" style="height:1.25em; width:auto; "/>

- .red[Large trees can be difficult to interpret]

- .red[All splits depend on previous splits] (i.e. capturing interactions ; additive models )

- .red[Trees are step functions] (i.e., binary splits)

- .red[Single trees typically have poor predictive accuracy]

- .red[Single trees have high variance] (easy to overfit to training data)

]

---
class: clear, center, middle

.font300.white[Bagging]

???

Image credit: [unsplash](https://unsplash.com/photos/19SC2oaVZW0)

---

# The problem with single trees

]

]

---

# .red[B]ootstrap .red[Agg]regat.red[ing]: wisdom of the crowd

1. Sample records with replacement (aka "bootstrap" the training data)

2. .white[Fit an overgrown tree to the resampled data set]

3. .white[Average predictions]

]

]

---

# .red[B]ootstrap .red[Agg]regat.red[ing]: wisdom of the crowd

1. .opacity[.grey[Sample records with replacement (aka "bootstrap" the training data)]]

2. Fit an overgrown tree to each resampled data set

3. .white[Average predictions]

]

]

---

# .red[B]ootstrap .red[Agg]regat.red[ing]: wisdom of the crowd

1. .opacity[.grey[Sample records with replacement (aka "bootstrap" the training data)]]

2. Fit an overgrown tree to each resampled data set

3. .white[Average predictions]

]

]

---

# .red[B]ootstrap .red[Agg]regat.red[ing]: wisdom of the crowd

1. .opacity[.grey[Sample records with replacement (aka "bootstrap" the training data)]]

2. .opacity[.grey[Fit an overgrown tree to each resampled data set]]

3. Average predictions

]

]

---

# .red[B]ootstrap .red[Agg]regat.red[ing]: wisdom of the crowd

.font120.bold[As we add more trees...]

]

.font120.bold[our average prediction error reduces]

]

---

# However, a .red[problem remains]

.center[.content-box-gray[.bold[which prevents bagging from optimally reducing variance of the predictive values]] <img src="https://emojis.slackmojis.com/emojis/images/1471045851/836/headbang.gif?1471045851" style="height:3em; width:auto; "/>]

---
class: clear, center, middle

.font300.white[Random Forests]

???

Image credit: [unsplash](https://unsplash.com/photos/5KvErlbdeyo)

---

# Idea

### Split-variable randomization

* .font120[Follow a similar bagging process but... ]

]

]

---
# Idea

### Split-variable randomization

* Follow a similar bagging process but...

* each time a split is to be performed, the search for the split variable is .blue[limited to a random subset of *m* of the *p* variables]
   - regression trees: `$m = \frac{p}{3}$`
   - classification trees: `$m = \sqrt{p}$` 
   - `$m$` is commonly referred to as .blue[___mtry___] .white[

* Bagging introduces randomness into the rows of the data

* Random forest introduces randomness into the rows and columns of the data
]
]

]

---

# Bagging vs Random Forest

* Follow a similar bagging process but...

* each time a split is to be performed, the search for the split variable is limited to a random subset of *m* of the *p* variables
]

* Bagging introduces .red[randomness into the rows] of the data

* Random forest introduces .red[randomness into the rows and columns] of the data

]

]

.center[.bold[.green[Combined, this provides a more diverse set of trees that almost always lowers our prediction error.]]]

---

# Out-of-bag

* For large enough N, on average, 63.21% or the original records end up in any bootstrap sample

* Roughly 36.79% of the observations are not used in the construction of a particular tree

* These observations are considered .red[out-of-bag (OOB)] and can be used for efficient assessment of model performance (.bold[unstructured, but free, cross-validation])]

.font90[.blue[Pro tip:
   - When N is small, OOB is less reliable than validation
   - As N increases, OOB is far more efficient than *k*-fold CV
   - When the number of trees are about 3x the number needed for the random forest to stabilize, the OOB error estimate is equivalent to leave-one-out cross-validation error.
   
]
]
]

]

---

# Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

- .blue[Number of trees]

- .blue[mtry]

- .grey[Node size]

- .grey[Sampling scheme]

- .green[Split rule]

]

- .blue[Typically have the largest impact on predictive accuracy.]

- .grey[Tend to have marginal impact on predictive accuracy but still worth exploring. Can also increase computational efficiency.]

- .green[Generally used to increase computational efficiency]

]

---

# Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

- .blue.bold[Number of trees] `$^a$`

- .bold[Why]: stabalize the error
   - .bold[Rule of thumb]: start with `$p \times 10$` trees and adjust as necessary
   - .bold[Caveats]:
       - small mtry and sample size values and/or larger node size values result in less correlated trees; therefore requiring more trees to converge.
       - more trees provide more robust/stable error & variable importance measures
   - .bold[Impact on computation time]: increases linearly with the number of trees

]
]

]

.font70[ *a) Technically, the number of trees is not a real tuning parameter but it is important to have a sufficient number for the estimate to stabilize.*]

---

# Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

- .blue.bold[Mtry]

- .bold[Why]: balance low tree correlation and reasonable predictive strength
 - .bold[Rule of thumb]: 
 - Regression default: `$\frac{p}{3}$` 
 - Classification default: `$\sqrt{p}$`
 - start with 5 values evenly spaced across the range from 2 to *p* (include the default)
 - .bold[Caveats]:
 - few relevant predictors: &nbsp;&nbsp; mtry 
 - many relevant predictors: mtry 
 - .bold[Impact on computation time]: increases approx linearly with higher mtry values.

]
]

]

---

# Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

- .blue.bold[Node size]

- .bold[Why]: balance tree complexity 
 - .bold[Rule of thumb]: 
 - Regression default: 5 
 - Classification default: 1
 - start with 3 values (1, 5, 10)
 - .bold[Caveats]:
 - many noisy predictors: node size 
 - if higher mtry values are performing best, node size 
 - .bold[Impact on computation time]: increases approx exponentially with small node sizes.
 - for very large data sets: node size

]
]

]

---

# Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

- .blue.bold[.opacity20[Node size] / Required split size / Max number of nodes / Max depth]

- Alternative parameters exist that can control tree complexity; however, most preferred random forest packages (__ranger__, __H2O__) focus on node size.
 - See [(Probst et al., 2018)](https://arxiv.org/pdf/1804.03515.pdf) for short discussion.
 
]

---

# Tuning

Random forests provide good "out-of-the-" performance but there are a few parameters we can tune to increase performance.

- .blue.bold[Sampling scheme]

- .bold[Why]: balance low tree correlation and reasonable predictive strength
 - .bold[Rule of thumb]: 
 - default value is 100% with replacement
 - assess 3-4 values ranging from 25%-100%
 - .bold[Caveats]:
 - if you have dominating features - sample size to minimize tree correlation
 - if you have many categorical features with varying number of levels - try sampling without replacement
 - .bold[Impact on computation time]: 
 - for very large data sets: sample size to decrease compute time

]
]

]

---

# Tuning

- .blue.bold[Split rule]

- .bold[Why]: Balance tree correlation and run time
 - .bold[Rule of thumb]:
 - Regression default: variance 
 - Classification default: Gini / cross-entropy
 - .bold[Caveats]:
 - Default split rules favor variables with many possible splits (continuous & categorical w/many levels)
 - Try extra random tree splitting [(Geurts et al., 2006)](https://link.springer.com/article/10.1007/s10994-006-6226-1) if:
 - many categorical variables with few levels
 - need to reduce run time
 - .bold[Impact on computation time]: Completely random split rule minimizes compute time since optimal split is not assessed; splits are made at random

]

---

# Variable Importance

We have two approaches for .blue[model specific variable importance] with random forests:

.font120.bold[Impurity]

1. At each split in each tree, compute the improvement in the split-criterion
2. Average the improvement made by each variable across all the trees that the variable is used
3. The variables with the largest average decrease in MSE are considered most important.

Notes:

- more trees lead to more stable vi estimates
- smaller mtry values lead to more equal vi estimates across all variables
- bias towards variables with many categories or numeric values

]

.font120.bold[Permutation]

1. For each tree, the OOB sample is passed down the tree and the prediction accuracy is recorded. 
2. Then the values for each variable (one at a time) are randomly permuted and the accuracy is again computed. 
3. The decrease in accuracy as a result of this randomly “shaking up” of variable values is averaged over all the trees for each variable. 
4. The variables with the largest average decrease in accuracy are considered most important.

Notes:

- more trees lead to more stable vi estimates
- smaller mtry values lead to more equal vi estimates across all variables
- categorical variables with many levels can have high variance vi estimates

]
]

---

# Variable Importance

The two tend to .blue[produce similar results but with slight differences in rank order]:

.font120.bold[Impurity]

]

.font120.bold[Permutation]

]
]

---
class: clear, center, middle

background-image: url(images/everyone-can-random-forest.jpg)
background-size: cover

Image credit: [unsplash](https://unsplash.com/photos/mDinBvq1Sfg)

---

# Prereqs

* __h2o__: `$n >> p$`
* __ranger__: `$p >> n$` &nbsp;&nbsp;&nbsp; (for this presentation I will demo __ranger__)

```r
# general EDA
library(dplyr)
library(ggplot2)

# machine learning
*library(ranger)
*library(h2o)
library(rsample)  # data splitting
library(vip)      # visualize feature importance 
library(pdp)      # visualize feature effects
```

]

```r
# Create training (70%) and test (30%) sets for the AmesHousing::make_ames() data.
# Use set.seed for reproducibility

set.seed(8451)
ames_split <- initial_split(AmesHousing::make_ames(), prop = .7, strata = "Sale_Price")
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
```

]

---

# Initial Implementation - training

* `formula`: formula specification

* `data`: training data

* `num.trees`: number of trees in the forest

* `mtry`: randomly selected predictor variables at each split. Default is `$\texttt{floor}(\sqrt{\texttt{number of features}})$` ; however, for regression problems the preferred `mtry` to start with is `$\texttt{floor}(\frac{\texttt{number of features}}{3}) = \texttt{floor}(\frac{80}{3}) = 26$`

* `respect.unordered.factors`: specifies how to treat unordered factor variables. We recommend setting this to "order" ([See ESL, section 9.2.4 for details](https://web.stanford.edu/~hastie/ElemStatLearn/)).

* `seed`: because this is a random algorithm, you will set the seed to get reproducible results

]
]

```r
# number of features
features <- setdiff(names(ames_train), "Sale_Price")

# perform basic random forest model
fit_default <- ranger(
 formula = Sale_Price ~ ., 
 data = ames_train, 
 num.trees = length(features) * 10,
 mtry = floor(length(features) / 3),
 respect.unordered.factors = 'order',
 verbose = FALSE,
 seed = 123
 )
```
]

---

# Initial Implementation - results

```r
# look at results
fit_default
## Ranger result
## 
## Call:
##  ranger(formula = Sale_Price ~ ., data = ames_train, num.trees = length(features) *      10, mtry = floor(length(features)/3), respect.unordered.factors = "order",      verbose = FALSE, seed = 123) 
## 
## Type:                             Regression 
## Number of trees:                  800 
## Sample size:                      2054 
## Number of independent variables:  80 
## Mtry:                             26 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       620208087 
## R squared (OOB):                  0.8957654

# compute RMSE (RMSE = square root of MSE)
sqrt(fit_default$prediction.error)
## [1] 24903.98
```
]

---

# Characteristics to Consider

What we do next should be driven by attributes of our data:

--
.scrollable90[
.pull-left[

- Half our variables are numeric
- Half are categorical variables with moderate number of levels
- Likely will favor .blue[variance split rule]
- May benefit from .blue[sampling w/o replacement]

```r
ames_train %>%
 summarise_if(is.factor, n_distinct) %>% 
 gather() %>% 
 arrange(desc(value))
## # A tibble: 46 x 2
## key value
## <chr> <int>
## 1 Neighborhood 27
## 2 Exterior_1st 16
## 3 Exterior_2nd 16
## 4 MS_SubClass 15
## 5 Overall_Qual 10
## 6 Sale_Type 10
## 7 Condition_1 9
## 8 Overall_Cond 9
## 9 House_Style 8
## 10 Functional 8
## # ... with 36 more rows
```

]

- We have highly correlated data (both btwn features and with target)
- May favor .blue[lower mtry] and
- .blue[lower node size] to help decorrelate the trees

```r
cor_matrix <- ames_train %>%
 mutate_if(is.factor, as.numeric) %>%
 cor()

# feature correlation
data_frame(
 row = rownames(cor_matrix)[row(cor_matrix)[upper.tri(cor_matrix)]],
 col = colnames(cor_matrix)[col(cor_matrix)[upper.tri(cor_matrix)]],
 corr = cor_matrix[upper.tri(cor_matrix)]
 ) %>%
 arrange(desc(abs(corr)))
## # A tibble: 3,240 x 3
## row col corr
## <chr> <chr> <dbl>
## 1 BsmtFin_Type_1 BsmtFin_SF_1 1 
## 2 Garage_Cars Garage_Area 0.888
## 3 Exterior_1st Exterior_2nd 0.856
## 4 Gr_Liv_Area TotRms_AbvGrd 0.802
## 5 Overall_Qual Sale_Price 0.800
## 6 Total_Bsmt_SF First_Flr_SF 0.789
## 7 MS_SubClass Bldg_Type 0.719
## 8 House_Style Second_Flr_SF 0.713
## 9 BsmtFin_Type_2 BsmtFin_SF_2 -0.702
## 10 Gr_Liv_Area Sale_Price 0.694
## # ... with 3,230 more rows

# target correlation
data_frame(
 row = rownames(cor_matrix)[row(cor_matrix)[upper.tri(cor_matrix)]],
 col = colnames(cor_matrix)[col(cor_matrix)[upper.tri(cor_matrix)]],
 corr = cor_matrix[upper.tri(cor_matrix)]
) %>% filter(col == "Sale_Price") %>%
 arrange(desc(abs(corr)))
## # A tibble: 78 x 3
## row col corr
## <chr> <chr> <dbl>
## 1 Overall_Qual Sale_Price 0.800
## 2 Gr_Liv_Area Sale_Price 0.694
## 3 Exter_Qual Sale_Price -0.662
## 4 Garage_Cars Sale_Price 0.655
## 5 Garage_Area Sale_Price 0.652
## 6 Total_Bsmt_SF Sale_Price 0.630
## 7 Kitchen_Qual Sale_Price -0.625
## 8 First_Flr_SF Sale_Price 0.617
## 9 Bsmt_Qual Sale_Price -0.575
## 10 Year_Built Sale_Price 0.571
## # ... with 68 more rows
```

]
]

---

# Tuning

But before we tune, do we have enough <img src="images/large-tree-icon.png" style="height:1em; width:auto; "/>s?

- Some pkgs provide OOB error for each tree
- __ranger__ only provides overall OOB

```r
# number of features
n_features <- ncol(ames_train) - 1

# tuning grid
tuning_grid <- expand.grid(
 trees = seq(10, 1000, by = 20),
 rmse = NA
)

for(i in seq_len(nrow(tuning_grid))) {
 fit <- ranger(
 formula = Sale_Price ~ ., 
 data = ames_train, 
* num.trees = tuning_grid$trees[i],
 mtry = floor(n_features / 3),
 respect.unordered.factors = 'order',
 verbose = FALSE,
 seed = 123
 )
 
 tuning_grid$rmse[i] <- sqrt(fit$prediction.error)
 
}
```

]
]

- using `$p \times 10 = 800$` trees is sufficient
- may increase if we decrease mtry or sample size

```r
ggplot(tuning_grid, aes(trees, rmse)) +
  geom_line(size = 1)
```

]
]

---

# Tuning

- lower end of mtry range due to correlation
- lower end of node size range due to correlation
- sampling w/o replacement due to categorical features

```r
hyper_grid <- expand.grid(
 mtry = floor(n_features * c(.05, .15, .25, .333, .4)),
 min.node.size = c(1, 3, 5),
 replace = c(TRUE, FALSE),
 sample.fraction = c(.5, .63, .8),
 rmse = NA
)

# number of hyperparameter combinations
nrow(hyper_grid)
## [1] 90

head(hyper_grid)
##   mtry min.node.size replace sample.fraction rmse
## 1    4             1    TRUE             0.5   NA
## 2   12             1    TRUE             0.5   NA
## 3   20             1    TRUE             0.5   NA
## 4   26             1    TRUE             0.5   NA
## 5   32             1    TRUE             0.5   NA
## 6    4             3    TRUE             0.5   NA
```

]
]

- This search grid took ~2.5 minutes
- __caret__ provides grid search [](https://topepo.github.io/caret/model-training-and-tuning.html)
- For larger data, use __H2O__'s random grid search with early stopping [](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html)

```r
for(i in seq_len(nrow(hyper_grid))) {
 
 # fit model for ith hyperparameter combination
 fit <- ranger(
 formula = Sale_Price ~ ., 
 data = ames_train, 
 num.trees = 1000,
* mtry = hyper_grid$mtry[i],
* min.node.size = hyper_grid$min.node.size[i],
* replace = hyper_grid$replace[i],
* sample.fraction = hyper_grid$sample.fraction[i],
 verbose = FALSE,
 seed = 123,
 respect.unordered.factors = 'order',
 )
 
 # export OOB error 
 hyper_grid$rmse[i] <- sqrt(fit$prediction.error)
 
}
```

]
]
]

---

# Tuning results

Our top 10 models:

- have ~1% or higher performance improvement over the default model
- sample w/o replacement
- primarily include higher sampling
- primarily use mtry = 20 or 26
- node size appears non-influential

I would follow this up with an additional grid search that focuses on:

- mtry values around 15, 18, 21, 24
- sample fraction around 63%, 70%, 75%, 80%

.center[.blue[_using too high of sampling fraction without replacement runs the risk of overfitting to your training data!_]]

]

```r
default_rmse <- sqrt(fit_default$prediction.error)

hyper_grid %>%
  arrange(rmse) %>%
  mutate(perc_gain = (default_rmse - rmse) / default_rmse * 100) %>%
  head(10)
##    mtry min.node.size replace sample.fraction     rmse perc_gain
## 1    20             1   FALSE            0.80 24474.19 1.7257766
## 2    20             5   FALSE            0.80 24485.64 1.6798126
## 3    20             3   FALSE            0.80 24555.24 1.4003421
## 4    26             3   FALSE            0.80 24612.76 1.1693799
## 5    20             1   FALSE            0.63 24613.27 1.1673219
## 6    26             1   FALSE            0.80 24615.42 1.1586911
## 7    26             5   FALSE            0.80 24617.94 1.1485760
## 8    20             3   FALSE            0.63 24642.72 1.0490463
## 9    12             1   FALSE            0.80 24659.98 0.9797534
## 10   12             3   FALSE            0.80 24702.53 0.8089133
```

]

---

# Feature Importance <a href="https://koalaverse.github.io/vip/index.html"><img src="images/logo-vip.png" class="pdp-hex", align="right"></a>

Once you find your optimal model:

- re-run with the respective hyperparameters
- include `importance` parameter
- crank up the # of trees to ensure stable vi estimates

```r
fit_final <- ranger(
 formula = Sale_Price ~ ., 
 data = ames_train, 
* num.trees = 2000,
 mtry = 20,
 min.node.size = 1,
 sample.fraction = .8,
 replace = FALSE,
* importance = 'permutation',
 respect.unordered.factors = 'order',
 verbose = FALSE,
 seed = 123
 )
```

]

```r
vip(fit_final, num_features = 15)
```

]

---

# Feature Effects <a href="https://bgreenwell.github.io/pdp/index.html"><img src="images/pdp-logo.png" class="pdp-hex", align="right"></a>

Partial dependence plots (PDPs), Individual Conditional Expectation (ICE) curves, and other approaches allow us to understand how _important_ variables influence our model's predictions:

.center.bold[PDP: Overall Home Quality]

```r
fit_final %>%
  partial(pred.var = "Overall_Qual", train = as.data.frame(ames_train)) %>%
  autoplot()
```

]

.center.bold[ICE: Overall Home Quality]

```r
fit_final %>%
  partial(pred.var = "Overall_Qual", train = as.data.frame(ames_train), ice = TRUE) %>%
  autoplot(alpha = 0.05, center = TRUE)
```

]

---

# Feature Effects <a href="https://bgreenwell.github.io/pdp/index.html"><img src="images/pdp-logo.png" class="pdp-hex", align="right"></a>

Partial dependence plots (PDPs), Individual Conditional Expectation (ICE) curves, and other approaches allow us to understand how _important_ variables influence our model's predictions:

.center.bold[PDP: Above Ground SqFt]

```r
fit_final %>%
  partial(pred.var = "Gr_Liv_Area", train = as.data.frame(ames_train)) %>%
  autoplot()
```

]

.center.bold[ICE: Above Ground SqFt]

```r
fit_final %>%
  partial(pred.var = "Gr_Liv_Area", train = as.data.frame(ames_train), ice = TRUE) %>%
  autoplot(alpha = 0.05, center = TRUE)
```

]

---

# Feature Effects <a href="https://bgreenwell.github.io/pdp/index.html"><img src="images/pdp-logo.png" class="pdp-hex", align="right"></a>

Interaction between two influential variables:

```r
fit_final %>%
  partial(
    pred.var = c("Gr_Liv_Area", "Year_Built"),
    train = as.data.frame(ames_train)
    ) %>%
  plotPartial(
    zlab = "Sale_Price",
    levelplot = FALSE, 
    drape = TRUE, 
    colorkey = FALSE,
    screen = list(z = 50, x = -60)
  )
```

]

]

.center[.content-bog-gray[.bold[Read more about machine learning interpretation [here](https://christophm.github.io/interpretable-ml-book/)]]]

---

# Random Forest Summary

- .green[Competitive performance.]
- .green[Remarkably good "out-of-the box"] (very little tuning required).
- .green[Built-in validation set] (don't need to sacrifice data for extra validation).
- .green[Typically does not overfit.]
- .green[Robust to outliers.]
- .green[Handles missing data] (imputation not required).
- .green[Provide automatic feature selection.]
- .green[Minimal preprocessing required.]
]

- .red[Although accurate, often cannot compete with the accuracy of advanced boosting algorithms.] 
- .red[Can become slow on large data sets.]
- .red[Less interpretable] (although this is easily addressed with various tools such as variable importance, partial dependence plots, LIME, etc.).

]

---

# Random Forest Summary

]

.font120[
_"Take the output of random forests not as absolute truth, but as smart computer generated guesses that may be helpful in leading to a deeper understanding of the problem."_

--- Leo Breiman

]
]

---

# Learning More

.center.font150[[Book website](http://www-bcf.usc.edu/~gareth/ISL/)]
]

.center.font150[[Book website](https://web.stanford.edu/~hastie/ElemStatLearn/)]
]

---
class: clear, center, middle

.font300.bold[Questions?]