Tune Learning Rate for Gradient Boosting with XGBoost in Python

A problem with gradient boosted decision trees is that they are quick to learn and overfit training data.

One effective way to slow down learning in the gradient boosting model is to use a learning rate, also called shrinkage (or eta in XGBoost documentation).

In this post you will discover the effect of the learning rate in gradient boosting and how to tune it on your machine learning problem using the XGBoost library in python.

After reading this post you will know:

The effect learning rate has on the gradient boosting model. How to tune learning rate on your machine learning on your problem. How to tune the trade-off between the number of boosted trees and learning rate on your problem.

Let’s get started.

Tune Learning Rate for Gradient Boosting with XGBoost in Python

Photo by Robert Hertel , some rights reserved.

The Algorithm that is Winning Competitions

...XGBoost for fast gradient boosting

XGBoost is the high performance implementation of gradient boosting that you can now access directly in Python.

Your PDF Download and Email Course.

FREE 7-Day Mini-Course on

XGBoostWithPython

Download Your FREE Mini-Course

Download your PDF containing all 7lessons.

Daily lesson via email with tips and tricks.

Slow Learning in Gradient Boosting with a Learning Rate

Gradient boosting involves creating and adding trees to the model sequentially.

New trees are created to correct the residual errors in the predictions from the existing sequence of trees.

The effect is that the model can quickly fit, then overfit the training dataset.

A technique to slow down the learning in the gradient boosting model is to apply a weighting factor for the corrections by new trees when added to the model.

This weighting is called the shrinkage factor or the learning rate, depending on the literature or the tool.

Naive gradient boosting is the same as gradient boosting with shrinkage where the shrinkage factor is set to 1.0. Setting values less than 1.0 has the effect of making less corrections for each tree added to the model. This in turn results in more trees that must be added to the model.

It is common to have small values in the range of 0.1 to 0.3, as well as values less than 0.1.

Let’s investigate the effect of the learning rate on a standard machine learning dataset.

Problem Description: Otto Dataset

In this tutorial we will use the Otto Group Product Classification Challenge dataset.

This dataset is available for free from Kaggle (you will need to sign-up to Kaggle to be able to download this dataset). You can download the training dataset train.csv.zip from the Data page and place the unzipped train.csv file into your working directory.

This dataset describes the 93 obfuscated details of more than 61,000 products grouped into 10 product categories (e.g. fashion, electronics, etc.). Input attributes are counts of different events of some kind.

The goal is to make predictions for new products as an array of probabilities for each of the 10 categories and models are evaluated using multiclass logarithmic loss (also called cross entropy).

This competition was completed in May 2015 and this dataset is a good challenge for XGBoost because of the nontrivial number of examples, the difficulty of the problem and the fact that little data preparation is required (other than encoding the string class variables as integers).

Tuning Learning Rate in XGBoost

When creating gradient boosting models with XGBoost using the scikit-learn wrapper, the learning_rate parameter can be set to control the weighting of new trees added to the model.

We can use the grid search capability in scikit-learn to evaluate the effect on logarithmic loss of training a gradient boosting model with different learning rate values.

We will hold the number of trees constant at the default of 100 and evaluate of suite of standard values for the learning rate on the Otto dataset.

learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]

There are 6 variations of learning rate to be tested and each variation will be evaluated using 10-fold cross validation, meaning that there isa total of 6×10 or 60 XGBoost models to be trained and evaluated.

The log loss for each learning rate will be printed as well as the value that resulted in the best performance.

# XGBoost on Otto dataset, Tune learning_rate frompandasimportread_csv fromxgboostimportXGBClassifier fromsklearn.grid_searchimportGridSearchCV fromsklearn.cross_validationimportStratifiedKFold fromsklearn.preprocessingimportLabelEncoder importmatplotlib matplotlib.use('Agg') frommatplotlibimportpyplot # load data data = read_csv('train.csv') dataset = data.values # split data into X and y X = dataset[:,0:94] y = dataset[:,94] # encode string class values as integers label_encoded_y = LabelEncoder().fit_transform(y) # grid search model = XGBClassifier() learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3] param_grid = dict(learning_rate=learning_rate) kfold = StratifiedKFold(label_encoded_y, n_folds=10, shuffle=True, random_state=7) grid_search = GridSearchCV(model, param_grid, scoring="log_loss", n_jobs=-1, cv=kfold) result = grid_search.fit(X, label_encoded_y) # summarize results print("Best: %f using %s" % (result.best_score_, result.best_params_)) means, stdevs = [], [] for params, mean_score, scoresin result.grid_scores_: stdev = scores.std() means.append(mean_score) stdevs.append(stdev) print("%f (%f) with: %r" % (mean_score, stdev, params)) # plot pyplot.errorbar(learning_rate, means, yerr=stdevs) pyplot.title("XGBoost learning_rate vs Log Loss") pyplot.xlabel('learning_rate') pyplot.ylabel('Log Loss') pyplot.savefig('learning_rate.png')

Running this example prints the best result as well as the log loss for each of the evaluated learning rates.

Best: -0.001156 using {'learning_rate': 0.2} -2.155497 (0.000081) with: {'learning_rate': 0.0001} -1.841069 (0.000716) with: {'learning_rate': 0.001} -0.597299 (0.000822) with: {'learning_rate': 0.01} -0.001239 (0.001730) with: {'learning_rate': 0.1} -0.001156 (0.001684) with: {'learning_rate': 0.2} -0.001158 (0.001666) with: {'learning_rate': 0.3}

Interestingly, we can see that the best learning rate was 0.2.

This is a high learning rate and it suggest that perhaps the default number of trees of 100 is too low and needs to be increased.

We can also plot the effect of the learning rate of the (inverted) log loss scores, although the log10-like spread of chosen learning_rate values means that most are squashed down the left-hand side of the plot near zero.

Tune Learning Rate in XGBoost

Next, we will look at varying the number of trees whilst varying the learning rate.

Tuning Learning Rate and the Number of Trees in XGBoost Smaller lear

Latest Images

Trending Articles

Latest Images