September 9, 2021
Brad Boehmke
Repo: bit.ly/xgb_db_tuning
Slides: bradleyboehmke.github.io/xgboost_databricks_tuning/index.html
Tutorial Docs: bit.ly/db_notebook_docs
Advanced XGBoost Hyperparameter Tuning on Databricks
Objective:
This tutorial makes the assumption that the reader:
Target audience: Data scientist transitioning to Databricks that has experience with machine learning in general and, more specifically, some experience with XGBoost. The reader's objective is to better understand what XGBoost hyperparameters to focus on, what values to explore, and how best to implement the search process within Databricks.
Gradient boosting machines (GBMs), and more specifically the XGBoost variant, is an extremely popular machine learning algorithm that has proven successful across many domains and, when appropriately tuned, is one of the leading algorithmic methods for tabular data.
Objective of this tutorial is to illustrate:
Note: These slides accompany a full length tutorial guide that can be found here.
# helper packages
import pandas as pd
import numpy as np
import time
import warnings
# modeling
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import xgboost as xgb
# hyperparameter tuning
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK
from hyperopt.pyll import scope
# model/grid search tracking
import mlflow
Well known wine quality dataset provided by the UCI Machine Learning Repository.
Our objective in this modeling task is to use wine characteristics in order to predict the quality of the wine high quality(>= 7) or low quality (< 7).
# 1. read in data
white_wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=";")
red_wine = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")
# 2. create indicator column for red vs. white wine
red_wine['is_red'] = 1
white_wine['is_red'] = 0
# 3. combine the red and white wine data sets
data = pd.concat([red_wine, white_wine], axis=0)
# 4. remove spaces from column names
data.rename(columns=lambda x: x.replace(' ', '_'), inplace=True)
# 5. convert "quality" column to 0 vs. 1 to make this a classification problem
data["quality"] = (data["quality"] >= 7).astype(int)
data.head(10)
# split data into train (75%) and test (25%) sets
train, test = train_test_split(data, random_state=123)
X_train = train.drop(columns="quality")
X_test = test.drop(columns="quality")
y_train = train["quality"]
y_test = test["quality"]
# create XGBoost DMatrix objects
train = xgb.DMatrix(data=X_train, label=y_train)
test = xgb.DMatrix(data=X_test, label=y_test)
There are many 😳 but the most common ones can be categorized as...
Stochastic behavior helps to reduce tree correlation and also helps reduce the chances of getting stuck in local minimas, plateaus, and other irregular terrain of the loss function.
Several approaches you can use for performing a hyperparameter grid search:
Why hyperopt:
A best practice strategy for a hyperopt workflow is as follows:
search_space = {
'learning_rate': hp.loguniform('learning_rate', -7, 0),
'max_depth': scope.int(hp.uniform('max_depth', 1, 100)),
'min_child_weight': hp.loguniform('min_child_weight', -2, 3),
'subsample': hp.uniform('subsample', 0.5, 1),
'colsample_bytree': hp.uniform('colsample_bytree', 0.5, 1),
'gamma': hp.loguniform('gamma', -10, 10),
'alpha': hp.loguniform('alpha', -10, 10),
'lambda': hp.loguniform('lambda', -10, 10),
'objective': 'binary:logistic',
'eval_metric': 'auc',
'seed': 123,
}
Hyperopt and MLFlow work great together!
Just need to define a function that returns status and loss
def train_model(params):
# With MLflow autologging, hyperparameters and the trained model are automatically logged to MLflow.
mlflow.xgboost.autolog(silent=True)
# However, we can log additional information by using an MLFlow tracking context manager
with mlflow.start_run(nested=True):
# Train model and record run time
start_time = time.time()
booster = xgb.train(params=params, dtrain=train, num_boost_round=5000, evals=[(test, "test")], early_stopping_rounds=50, verbose_eval=False)
run_time = time.time() - start_time
mlflow.log_metric('runtime', run_time)
# Record AUC as primary loss for Hyperopt to minimize
predictions_test = booster.predict(test)
auc_score = roc_auc_score(y_test, predictions_test)
# Set the loss to -1*auc_score so fmin maximizes the auc_score
return {'status': STATUS_OK, 'loss': -auc_score, 'booster': booster.attributes()}
To execute the search we use fmin
and supply it our model training (objective) function along with the hyperparameter search space. fmin
can use different algorithms to search across the hyperparameter search space (i.e. random, Bayesian); however, we suggest using the Tree of Parsen Estimators (tpe.suggest
) which will perform a smart Bayesian optimization grid search.
#spark_trials = SparkTrials(parallelism=4)
# runs initial search to assess 25 hyperparameter combinations
with mlflow.start_run(run_name='initial_search'):
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
max_evals=25,
rstate=np.random.RandomState(123),
#trials=spark_trials
)
best_params
Out[15]: {
'alpha': 1.2573697498285759,
'colsample_bytree': 0.6246623099667723,
'gamma': 0.4299177395431556,
'lambda': 0.6655776343087407,
'learning_rate': 0.10108159135348746,
'max_depth': 8.571533913539605,
'min_child_weight': 1.3053934392357864,
'subsample': 0.6654338738457878
}
with mlflow.start_run(run_name='xgb_timeout'):
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
timeout=60*5, # stop the grid search after 5 * 60 seconds == 5 minutes
#trials=spark_trials,
rstate=np.random.RandomState(123)
)
with mlflow.start_run(run_name='xgb_loss_threshold'):
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
loss_threshold=-0.92, # stop the grid search once we've reached an AUC of 0.92 or higher
timeout=60*10, # stop after 5 minutes regardless if we reach an AUC of 0.92
#trials=spark_trials,
rstate=np.random.RandomState(123)
)
Tuning an XGBoost algorithm is no small feat.
This tutorial outlined
XGBoost
Hyperopt
MLFlow
1 - Assess the tutorial landscape that currently exists (1-2 hrs)
2 - Decide on a topic & delivery format
3 - Research unknown components (hyperopt & mlflow) (1-2 hrs)
4 - Tinker with code for discovery purposes (1-2 hrs)
5 - Remove unncessary code and fill in with prose context (2 hrs)
6 - Independent reviews & incorporate feedback (30 min)
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |