Advanced XGBoost Hyperparameter Tuning on Databricks


September 9, 2021








Brad Boehmke

Repo: bit.ly/xgb_db_tuning

Slides: bradleyboehmke.github.io/xgboost_databricks_tuning/index.html

Tutorial Docs: bit.ly/db_notebook_docs

Presenter Notes

Agenda

  1. Tutorial objective
  2. Why this topic?
  3. Where it fits in current offerings
  4. Assumptions made
  5. Tutorial
  6. My approach
  7. Q&A

Presenter Notes

Tutorial objective

Advanced XGBoost Hyperparameter Tuning on Databricks

Objective:

  • Best practices for tuning XGBoost hyperparameters
  • Using tightly coupled Databricks tooling for...

       - Effective and efficient scaling of XGBoost hyperparameter search (Hyperopt)
       - Tracking and organizing grid search performance (MLFlow)

Tooling used

Presenter Notes

Why this topic?

  • Demand (insider trading information)
  • Leverage expertise to shorten learning curve
  • Interest
  • Disparate information
  • Centralization & tighter coupling

Presenter Notes

Where it fits

Presenter Notes

Asumptions made

This tutorial makes the assumption that the reader:

  • Time: Plans to spend ~ 10min reading the tutorial (< 3,000 words)
  • Language: Comfortable using Python for basic data wrangling tasks, writing functions, and applying context managers
  • ML: Understands the basics of the GBM/XGBoost algorithm and is familiar with the idea of hyperparameter tuning
  • Environment: Has access to a Databricks ML runtime cluster to reproduce results (~ 20 min compute time)

Target audience: Data scientist transitioning to Databricks that has experience with machine learning in general and, more specifically, some experience with XGBoost. The reader's objective is to better understand what XGBoost hyperparameters to focus on, what values to explore, and how best to implement the search process within Databricks.

Presenter Notes

(Abridged) Tutorial

Presenter Notes

Intro

Gradient boosting machines (GBMs), and more specifically the XGBoost variant, is an extremely popular machine learning algorithm that has proven successful across many domains and, when appropriately tuned, is one of the leading algorithmic methods for tabular data.

Objective of this tutorial is to illustrate:

  • Best practices for tuning XGBoost hyperparameters
  • Leveraging Hyperopt for an effective and efficient XGBoost grid search
  • Using MLflow for tracking and organizing grid search performance



Note: These slides accompany a full length tutorial guide that can be found here.

Presenter Notes

Assumptions

  1. Language: Comfortable using Python for basic data wrangling tasks, writing functions, and applying context managers
  2. ML: Understands the basics of the GBM/XGBoost algorithm and is familiar with the idea of hyperparameter tuning
  3. Environment: Has access to a Databricks ML runtime cluster to reproduce results (~ 20 min compute time)

Presenter Notes

Prerequisites: Packages

# helper packages
import pandas as pd
import numpy as np
import time
import warnings

# modeling
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import xgboost as xgb

# hyperparameter tuning
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK
from hyperopt.pyll import scope

# model/grid search tracking
import mlflow

Presenter Notes

Prerequisites: Data

Well known wine quality dataset provided by the UCI Machine Learning Repository.

Our objective in this modeling task is to use wine characteristics in order to predict the quality of the wine high quality(>= 7) or low quality (< 7).

# 1. read in data
white_wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=";")
red_wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")

# 2. create indicator column for red vs. white wine
red_wine['is_red'] = 1
white_wine['is_red'] = 0

# 3. combine the red and white wine data sets
data = pd.concat([red_wine, white_wine], axis=0)

# 4. remove spaces from column names
data.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)

# 5. convert "quality" column to 0 vs. 1 to make this a classification problem
data["quality"] = (data["quality"] >= 7).astype(int)

Presenter Notes

Prerequisites: Data

data.head(10)

Wine data

# split data into train (75%) and test (25%) sets
train, test = train_test_split(data, random_state=123)
X_train = train.drop(columns="quality")
X_test = test.drop(columns="quality")
y_train = train["quality"]
y_test = test["quality"]

# create XGBoost DMatrix objects
train = xgb.DMatrix(data=X_train, label=y_train)
test = xgb.DMatrix(data=X_test, label=y_test)

Presenter Notes

XGBoost hyperparameters

There are many 😳 but the most common ones can be categorized as...

  • Boosting hyperparameters: controls how we proceed down the gradient descent process (aka gradient boosting)
  • Tree hyperparameters: controls how we build our base learner decision trees
  • Stochastic hyperparameters: controls how we randomly subsample our training data during the model building process
  • Regularization hyperparameters: controls model complexity to guard against overfitting

Presenter Notes

Boosting hyperparameters

  • Number of trees: XGBoost allows us to apply early stopping. We simply need enough trees to minimize loss.
  • learning_rate: Determines the contribution of each tree on the final outcome and controls how quickly the algorithm learns. Recommendation: Search across values ranging from 0.0001-1 on a log scale (i.e. 0.0001, 0.001, 0.01, 0.1, 1).

Learning rate

Presenter Notes

Tree hyperparameters

  • max_depth: Explicitly controls the depth of the individual trees. Recommendation: Uniformly search across values ranging from 1-10 but be willing to increase the high value range for larger datasets.
  • min_child_weight: Implicitly controls the complexity of each tree by requiring the minimum number of instances (measured by hessian within XGBoost) to be greater than a certain value for further partitioning to occur. Recommendation: Uniformly search across values ranging from near zero-20 but be willing to increase the high value range for larger datasets.

Shallow vs. deep trees

Presenter Notes

Stochastic hyperparameters

Stochastic behavior helps to reduce tree correlation and also helps reduce the chances of getting stuck in local minimas, plateaus, and other irregular terrain of the loss function.

  • subsample: Subsampling rows before creating each tree. Useful when there are dominating features in your dataset. Recommendation: Uniformly search across values ranging from 0.5-1.0.
  • colsample_bytree: Subsampling of columns before creating each tree (i.e. mtry in random forests). Useful for large datasets or when multicollinearity exists. Recommendation: Uniformly search across values ranging from 0.5-1.0.
  • colsample_bylevel & colsample_bynode: Additional procedures for sampling columns as you build a tree. Useful for datasets with many highly correlated features. Recommendation: Uniformly search across values ranging from 0.5-1.0.

Presenter Notes

Regularization hyperparameters

  • gamma: Controls the complexity of a given tree by growing the tree to the max depth but then pruning the tree to find and remove splits that do not meet the specified gamma. Recommendation: Search across values ranging from 0-some large number on a log scale (i.e. 0, 1, 10, 100, 1000, etc.).
  • alpha: Provides an L2 regularization to the loss function, which is similar to the Ridge penalty commonly used for regularized regression. Recommendation: Search across values ranging from 0-some large number on a log scale (i.e. 0, 1, 10, 100, 1000, etc.).
  • lambda: Provides an L1 regularization to the loss function, which is similar to the Lasso penalty commonly used for regularized regression. Recommendation: Search across values ranging from 0-some large number on a log scale (i.e. 0, 1, 10, 100, 1000, etc.).

Regularization

Presenter Notes

Hyperopt for hyperparameter search

Several approaches you can use for performing a hyperparameter grid search:

  • full cartesian grid search
  • random grid search
  • Bayesian optimization

Why hyperopt:

  • Open source
  • Bayesian optimizer – smart searches over hyperparameters (using a Tree of Parzen Estimators), not grid or random search
  • Integrates with Apache Spark for parallel hyperparameter search
  • Integrates with MLflow for automatic tracking of the search results
  • Included in the Databricks ML runtime
  • Maximally flexible: can optimize literally any Python model with any hyperparameters

Presenter Notes

Hyperparameter search space

A best practice strategy for a hyperopt workflow is as follows:

  1. Choose what hyperparameters are reasonable to optimize
  2. Define broad ranges for each of the hyperparameters (including the default where applicable)
  3. Run a small number of trials
  4. Observe the results in an MLflow parallel coordinate plot and select the runs with lowest loss
  5. Move the range towards those higher/lower values when the best runs’ hyperparameter values are pushed against one end of a range
  6. Determine whether certain hyperparameter values cause fitting to take a long time (and avoid those values)
  7. Re-run with more trials
  8. Repeat until the best runs are comfortably within the given search bounds and none are taking excessive time

Presenter Notes

Hyperparameter search space

Hyperparameter search space

search_space = {
    'learning_rate': hp.loguniform('learning_rate', -7, 0),
    'max_depth': scope.int(hp.uniform('max_depth', 1, 100)),
    'min_child_weight': hp.loguniform('min_child_weight', -2, 3),
    'subsample': hp.uniform('subsample', 0.5, 1),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
    'gamma': hp.loguniform('gamma', -10, 10),
    'alpha': hp.loguniform('alpha', -10, 10),
    'lambda': hp.loguniform('lambda', -10, 10),
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'seed': 123,
}

Presenter Notes

Defining the model training process

Hyperopt and MLFlow work great together!

Just need to define a function that returns status and loss

def train_model(params):

    # With MLflow autologging, hyperparameters and the trained model are automatically logged to MLflow.
    mlflow.xgboost.autolog(silent=True)

    # However, we can log additional information by using an MLFlow tracking context manager
    with mlflow.start_run(nested=True):

        # Train model and record run time
        start_time = time.time()
        booster = xgb.train(params=params, dtrain=train, num_boost_round=5000, evals=[(test, "test")], early_stopping_rounds=50, verbose_eval=False)
        run_time = time.time() - start_time
        mlflow.log_metric('runtime', run_time)

        # Record AUC as primary loss for Hyperopt to minimize
        predictions_test = booster.predict(test)
        auc_score = roc_auc_score(y_test, predictions_test)

        # Set the loss to -1*auc_score so fmin maximizes the auc_score
        return {'status': STATUS_OK, 'loss': -auc_score, 'booster': booster.attributes()}

Presenter Notes

Executing the search

To execute the search we use fmin and supply it our model training (objective) function along with the hyperparameter search space. fmin can use different algorithms to search across the hyperparameter search space (i.e. random, Bayesian); however, we suggest using the Tree of Parsen Estimators (tpe.suggest) which will perform a smart Bayesian optimization grid search.

#spark_trials = SparkTrials(parallelism=4)

# runs initial search to assess 25 hyperparameter combinations
with mlflow.start_run(run_name='initial_search'):
    best_params = fmin(
      fn=train_model,
      space=search_space,
      algo=tpe.suggest,
      max_evals=25,
      rstate=np.random.RandomState(123),
      #trials=spark_trials
    )

best_params
Out[15]: {
    'alpha': 1.2573697498285759,
    'colsample_bytree': 0.6246623099667723,
    'gamma': 0.4299177395431556,
    'lambda': 0.6655776343087407,
    'learning_rate': 0.10108159135348746,
    'max_depth': 8.571533913539605,
    'min_child_weight': 1.3053934392357864,
    'subsample': 0.6654338738457878
    }

Presenter Notes

Assess results

MLFlow experiment


MLFlow experiment logs

Presenter Notes

Assess results

Parallel coordinates plot

Presenter Notes

Assess results

Model performance vs runtime

Presenter Notes

Alternative early stopping procedures

with mlflow.start_run(run_name='xgb_timeout'):
    best_params = fmin(
        fn=train_model,
        space=search_space,
        algo=tpe.suggest,
        timeout=60*5, # stop the grid search after 5 * 60 seconds == 5 minutes
        #trials=spark_trials,
        rstate=np.random.RandomState(123)
    )

with mlflow.start_run(run_name='xgb_loss_threshold'):
    best_params = fmin(
        fn=train_model,
        space=search_space,
        algo=tpe.suggest,
        loss_threshold=-0.92, # stop the grid search once we've reached an AUC of 0.92 or higher
        timeout=60*10,        # stop after 5 minutes regardless if we reach an AUC of 0.92
        #trials=spark_trials,
        rstate=np.random.RandomState(123)
    )

Presenter Notes

Summary

Tuning an XGBoost algorithm is no small feat.

This tutorial outlined

  • the primary hyperparameters that tend to impact model performance along with recommended values to explore for each hyperparameter,
  • how to use Hyperopt for an intelligent, Bayesian optimization approach to explore the search space,
  • how to use MLFlow to log and organize the hyperparameter exploration within Databricks.

Presenter Notes

My approach

1 - Assess the tutorial landscape that currently exists (1-2 hrs)

2 - Decide on a topic & delivery format

3 - Research unknown components (hyperopt & mlflow) (1-2 hrs)

4 - Tinker with code for discovery purposes (1-2 hrs)

5 - Remove unncessary code and fill in with prose context (2 hrs)

6 - Independent reviews & incorporate feedback (30 min)

Presenter Notes

Questions?

Presenter Notes