6 Lesson 2b: Model evaluation & selection
The last couple of lessons gave you a good introduction to building predictive models using the tidymodels construct. This lesson is going to go deeper into the idea of model evaluation & selection. We’ll discuss how to incorporate cross-validation procedures to give you a more robust assessment of model performance. We’ll also discuss the concept of hyperparameter tuning, the bias-variance tradeoff, and how to implement a tuning strategy to find a model the maximizes generalizability.
6.1 Learning objectives
By the end of this lesson you will be able to:
- Perform cross-validation procedures for more robust model performance assessment.
- Execute hyperparameter tuning to find optimal model parameter settings.
6.2 Prerequisites
For this lesson we’ll use several packages provided via tidymodels and we’ll use the ames
housing data. However, for this module we’ll use the Ames housing data provided by the AmesHousing package.
Take a minute to check out the Ames data from
AmesHousing::make_ames()
. It is very similar to the Ames
data provided in the CSV. Note that the various quality/condition
variables (i.e. Overall_Qual
) are already encoded as
ordinal factors.
library(tidymodels)
<- AmesHousing::make_ames() ames
Let’s go ahead and create our train-test split:
# create train/test split
set.seed(123) # for reproducibility
<- initial_split(ames, prop = 0.7)
split <- training(split)
train <- testing(split) test
6.3 Resampling & cross-validation
In the previous lessons we split our data into training and testing sets and we assessed the performance of our model on the test set. Unfortunately, there are a few pitfalls to this approach:
- If our dataset is small, a single test set may not provide realistic expectations of our model’s performance on unseen data.
- A single test set does not provide us any insight on variability of our model’s performance.
- Using our test set to drive our model building process can bias our results via data leakage.
Resampling methods provide an alternative approach by allowing us to repeatedly fit a model of interest to parts of the training data and test its performance on other parts of the training data.
Figure 6.1: Illustration of resampling.
This allows us to train and validate our model entirely on the training data and not touch the test data until we have selected a final “optimal” model.
The two most commonly used resampling methods include k-fold cross-validation and bootstrap sampling. This lesson focuses on using k-fold cross-validation.
6.4 K-fold cross-validation
Cross-validation consists of repeating the procedure such that the training and testing sets are different each time. Generalization performance metrics are collected for each repetition and then aggregated. As a result we can get an estimate of the variability of the model’s generalization performance.
k-fold cross-validation (aka k-fold CV) is a resampling method that randomly divides the training data into k groups (aka folds) of approximately equal size.

Figure 6.2: Illustration of k-fold sampling across a data sets index.
The model is fit on \(k-1\) folds and then the remaining fold is used to compute model performance. This procedure is repeated k times; each time, a different fold is treated as the validation set. Consequently, with k-fold CV, every observation in the training data will be held out one time to be included in the assessment/validation set. This process results in k estimates of the generalization error (say \(\epsilon_1, \epsilon_2, \dots, \epsilon_k\)). Thus, the k-fold CV estimate is computed by averaging the k test errors, providing us with an approximation of the error we might expect on unseen data.

Figure 6.3: Illustration of a 5-fold cross validation procedure.
In practice, one typically uses k=5 or k=10. There is no formal rule as to the size of k; however, as k gets larger, the difference between the estimated performance and the true performance to be seen on the test set will decrease.
To implement k-fold CV we first make a resampling object. In this example we create a 10-fold resampling object.
<- vfold_cv(train, v = 10) kfolds
We can now create our random forest model object and create a workflow object as we did in the previous lesson. To fit our model across our 10-folds we just use fit_resamples()
.
# create our random forest model object
<- rand_forest() %>%
rf_mod set_mode('regression')
# add model object and our formula spec to a workflow object
<- workflow() %>%
rf_wflow add_model(rf_mod) %>%
add_formula(Sale_Price ~ .)
# fit our model across the 10-fold CV
<- rf_wflow %>%
rf_fit_cv fit_resamples(kfolds)
We can then get our average 10-fold cross validation error with collect_metrics()
:
collect_metrics(rf_fit_cv)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 25978. 10 898. Preprocessor1_Model1
## 2 rsq standard 0.902 10 0.00787 Preprocessor1_Model1
If we want to see the model evaluation metric (i.e. RMSE) for each fold we just need to unnest the rf_fit_cv
object.
We have not discussed nested data frames but you can read about them here
%>%
rf_fit_cv unnest(.metrics) %>%
filter(.metric == 'rmse')
## # A tibble: 10 × 7
## splits id .metric .estimator .estimate .config .notes
## <list> <chr> <chr> <chr> <dbl> <chr> <list>
## 1 <split [1845/206]> Fold01 rmse standard 26210. Preprocessor1_Model1 <tibble>
## 2 <split [1846/205]> Fold02 rmse standard 22089. Preprocessor1_Model1 <tibble>
## 3 <split [1846/205]> Fold03 rmse standard 30512. Preprocessor1_Model1 <tibble>
## 4 <split [1846/205]> Fold04 rmse standard 26106. Preprocessor1_Model1 <tibble>
## 5 <split [1846/205]> Fold05 rmse standard 25095. Preprocessor1_Model1 <tibble>
## 6 <split [1846/205]> Fold06 rmse standard 24337. Preprocessor1_Model1 <tibble>
## 7 <split [1846/205]> Fold07 rmse standard 23586. Preprocessor1_Model1 <tibble>
## 8 <split [1846/205]> Fold08 rmse standard 30650. Preprocessor1_Model1 <tibble>
## 9 <split [1846/205]> Fold09 rmse standard 27256. Preprocessor1_Model1 <tibble>
## 10 <split [1846/205]> Fold10 rmse standard 23935. Preprocessor1_Model1 <tibble>
6.5 Hyperparameter tuning
Say you have the below relationship between a response variable and some predictor variable x
(gray dots). Given two different models fit to this relationship (blue line), which model do you prefer?

Figure 6.4: Between model A and B, which do you think is better?
The image above illustrates the fact that prediction errors can be decomposed into two main subcomponents we care about:
- error due to “bias”
- error due to “variance”
Understanding how different sources of error lead to bias and variance helps us improve the data fitting process resulting in more accurate models.
6.5.1 Bias
Error due to bias is the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. It measures how far off in general a model’s predictions are from the correct value, which provides a sense of how well a model can conform to the underlying structure of the data.
The left image below illustrates an example where a polynomial model does not capture the underlying relationship well. Linear models are classical examples of high bias models as they are less flexible and rarely capture non-linear, non-monotonic relationships.
We also need to think of bias-variance in relation to resampling. Models with high bias are rarely affected by the noise introduced by resampling. If a model has high bias, it will have consistency in its resampling performance as illustrated by the right plot below.

Figure 6.5: A biased polynomial model fit to a single data set does not capture the underlying non-linear, non-monotonic data structure (left). Models fit to 25 bootstrapped replicates of the data are underterred by the noise and generates similar, yet still biased, predictions (right).
6.5.2 Variance
Error due to variance is the variability of a model prediction for a given data point.
Many models (e.g., k-nearest neighbor, decision trees, gradient boosting machines) are very adaptable and offer extreme flexibility in the patterns that they can fit to. However, these models offer their own problems as they run the risk of overfitting to the training data. Although you may achieve very good performance on your training data, the model will not automatically generalize well to unseen data.

Figure 6.6: A high variance k-nearest neighbor model fit to a single data set captures the underlying non-linear, non-monotonic data structure well but also overfits to individual data points (left). Models fit to 25 bootstrapped replicates of the data are deterred by the noise and generate highly variable predictions (right).
Since high variance models are more prone to overfitting, using resampling procedures are critical to reduce this risk. Moreover, many algorithms that are capable of achieving high generalization performance have lots of hyperparameters that control the level of model complexity (i.e., the tradeoff between bias and variance).
Many high performing models (i.e. random forests, gradient boosting machines, deep learning) are very flexible in the patterns they can conform to due to the many hyperparameters they have. However, this also means they are prone to overfitting (aka can have high variance error).
6.5.3 Hyperparameters
Hyperparameters (aka tuning parameters) are the “knobs to twiddle” to control the complexity of machine learning algorithms and, therefore, the bias-variance trade-off. Not all algorithms have hyperparameters (e.g., ordinary least squares8); however, most have at least one or more.
Some models have very few hyperparameters. For example, k-nearest neighbor models have a single hyperparameter (k) that determines the predicted value to be made based on the k nearest observations in the training data to the one being predicted. If k is small (e.g., \(k = 3\)), the model will make a prediction for a given observation based on the average of the response values for the 3 observations in the training data most similar to the observation being predicted. This often results in highly variable predicted values because we are basing the prediction (in this case, an average) on a very small subset of the training data. As k gets bigger, we base our predictions on an average of a larger subset of the training data, which naturally reduces the variance in our predicted values (remember this for later, averaging often helps to reduce variance!).
The plot below illustrates this point. Smaller k values (e.g., 2, 5, or 10) lead to high variance (but lower bias) and larger values (e.g., 150) lead to high bias (but lower variance). The optimal k value might exist somewhere between 20–50, but how do we know which value of k to use?

Figure 6.7: k-nearest neighbor model with differing values for k.
Some algorithms such as ordinary least squares don’t have any hyperparameters. Some algorithms such as k-nearest neighbor have one or two hyperparameters. And some algorithms such as gradient boosted machines (GBMs) and deep learning models have many.
Hyperparameter tuning is the process of screening hyperparameter values (or combinations of hyperparameter values) to find a model that balances bias & variance so that the model generalizes well to unseen data.
Let’s illustrate by using decision trees. One of the key hyperparameters in decision trees is the depth of the tree.
This lesson does not dig into the decision tree algorithm but if you want to better understand the hyperparameters for decision trees you can read about them here
Say we wanted to assess what happens when we grow the decision tree 5 levels deep. We could do this manually:
# create our decision tree model object
<- decision_tree(tree_depth = 5) %>%
dt_mod set_mode('regression')
# add model object and our formula spec to a workflow object
<- workflow() %>%
dt_wflow add_model(dt_mod) %>%
add_formula(Sale_Price ~ .)
# fit our model across the 10-fold CV
<- dt_wflow %>%
dt_fit_cv fit_resamples(kfolds)
# assess results
collect_metrics(dt_fit_cv)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 39175. 10 1382. Preprocessor1_Model1
## 2 rsq standard 0.758 10 0.0156 Preprocessor1_Model1
But what if we wanted to assess and compare different tree_depth
values. Moreover, decision trees have another key hyperparameter cost_complexity
. So what if we wanted to assess a few values of that hyperaparameter in combination with tree_depth
? Adjusting these values manually would be painstakingly burdensome.
Again, don’t worry if you have no idea what these hyperparameters mean. Just realize we want to toggle these values to try find an optimal model.
6.5.4 Full cartesian grid search
For this we could use a full cartesian grid search. A full cartesian grid search takes the values provided for each hyperparameter and assesses every combination.
First, let’s rebuild our decision tree model object; however, this time we’ll create a model specification that identifies which hyperparameters we plan to tune.
<- decision_tree(
dt_mod cost_complexity = tune(), # <-- these are hyperparameters we want to tune
tree_depth = tune() # <-- these are hyperparameters we want to tune
%>%
) set_engine("rpart") %>%
set_mode("regression")
Think of tune()
here as a placeholder. After the tuning
process, we will select a single numeric value for each of these
hyperparameters. For now, we specify our parsnip model object and
identify the hyperparameters we will tune()
.
Next, we create our tuning grid of hyperparameter values we want to assess. The function grid_regular()
is from the dials package. It chooses sensible values to try for each hyperparameter; here, we asked for 3 of each. Since we have two to tune, grid_regular()
returns \(3 \times 3 = 9\) different possible tuning combinations to try.
A full cartesian grid search can explode as you add more
hyperparameters and values to assess. When this happens its best to
start using grid_random
or
grid_latin_hypercube
to reduce combinatorial explosion and
computation time.
<- grid_regular(
dt_grid cost_complexity(),
tree_depth(),
levels = 3
)
dt_grid## # A tibble: 9 × 2
## cost_complexity tree_depth
## <dbl> <int>
## 1 0.0000000001 1
## 2 0.00000316 1
## 3 0.1 1
## 4 0.0000000001 8
## 5 0.00000316 8
## 6 0.1 8
## 7 0.0000000001 15
## 8 0.00000316 15
## 9 0.1 15
Now that we have our tuning grid and model object defined we can:
- Create our k-fold CV object (5-fold in this example),
- Add our model to a workflow object and specify the formula,
- Apply
tune_grid()
to execute our hyperparameter tuning.
# 5-fold instead of 10-fold to reduce computation time
<- vfold_cv(train, v = 5)
kfolds
# add model object and our formula spec to a workflow object
<- workflow() %>%
dt_wflow add_model(dt_mod) %>%
add_formula(Sale_Price ~ .)
# fit our model across the 5-fold CV
<- dt_wflow %>%
dt_grid_search tune_grid(
resamples = kfolds,
grid = dt_grid
)
We can check out the hyperparameter combinations that resulted in the best model performance with show_best()
. Here we look at the top 5 models and we can see that the top 4 all tend to perform very similarly. It appears that deeper trees with smaller cost complexity factor tend to perform best on this data set.
%>%
dt_grid_search show_best(metric = 'rmse')
## # A tibble: 5 × 8
## cost_complexity tree_depth .metric .estimator mean n std_err .config
## <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0000000001 8 rmse standard 36638. 5 1515. Preprocessor1_Model4
## 2 0.00000316 8 rmse standard 36638. 5 1515. Preprocessor1_Model5
## 3 0.0000000001 15 rmse standard 36703. 5 1587. Preprocessor1_Model7
## 4 0.00000316 15 rmse standard 36703. 5 1587. Preprocessor1_Model8
## 5 0.1 8 rmse standard 51944. 5 1834. Preprocessor1_Model6
6.6 Finalizing our model
If we are satisfied with our results and we want to use the best hyperparameter values for our best decision tree model, we can select it with select_best()
:
# select best model based on RMSE metric
<- dt_grid_search %>%
best_tree select_best(metric = 'rmse')
best_tree## # A tibble: 1 × 3
## cost_complexity tree_depth .config
## <dbl> <int> <chr>
## 1 0.0000000001 8 Preprocessor1_Model4
We can then update (or “finalize”) our workflow object dt_wflow
with the values from select_best()
.
<- dt_wflow %>%
final_wflow finalize_workflow(best_tree)
Finally, let’s fit this final model to the training data and use our test data to estimate the model performance we expect to see with new data. We can use the function last_fit()
with our finalized model; this function fits the finalized model on the full training data set and evaluates the finalized model on the testing data.
We pass the initial train-test split object we created at the
beginning of this lesson to last_fit
.
<- final_wflow %>%
final_fit last_fit(split)
%>%
final_fit collect_predictions() %>%
rmse(truth = Sale_Price, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 34609.
As we can see our test RMSE is less than our CV RMSE. This indicates that we did not overfit during our tuning procedure, which is a good thing.
Perhaps we would also like to understand what variables are important in this final model. We can use the vip package to estimate variable importance based on the model’s structure.
Don’t worry, we’ll talk about vip more later on and just how this measure of importance is computed.
library(vip)
%>%
final_fit extract_fit_parsnip() %>%
vip()
6.7 Exercises
Import the dataset blood_transfusion.csv:
- The column “Class” contains the target variable. Investigate this variable. Is this a regression or classification problem?
-
Why is it relevant to add a preprocessing step to standardize the
features? What
step_xxx()
function would you use to do so? -
Perform a k-nearest neighbor model on this data with
neighbors = 10
. Be sure to add a preprocessing step to standardize the features. - Perform a 5-fold cross validation with the above model workflow. What is your average CV score?
-
Now perform hyperparameter tuning to understand the effect of the
parameter
neighbors
on the model score. Assess 10 values forneighbors
between the range of 1-100. Again, perform a 5-fold cross validation. Which hyperparameter value performed the best and what was the CV score?
6.8 Additional resources
This module provided a very high-level introduction to predictive modeling with tidymodels. To build on this knowledge you can find many great resources at https://www.tidymodels.org/.