Book Image

Hands-On Gradient Boosting with XGBoost and scikit-learn

By : Corey Wade
Book Image

Hands-On Gradient Boosting with XGBoost and scikit-learn

By: Corey Wade

Overview of this book

XGBoost is an industry-proven, open-source software library that provides a gradient boosting framework for scaling billions of data points quickly and efficiently. The book introduces machine learning and XGBoost in scikit-learn before building up to the theory behind gradient boosting. You’ll cover decision trees and analyze bagging in the machine learning context, learning hyperparameters that extend to XGBoost along the way. You’ll build gradient boosting models from scratch and extend gradient boosting to big data while recognizing speed limitations using timers. Details in XGBoost are explored with a focus on speed enhancements and deriving parameters mathematically. With the help of detailed case studies, you’ll practice building and fine-tuning XGBoost classifiers and regressors using scikit-learn and the original Python API. You'll leverage XGBoost hyperparameters to improve scores, correct missing values, scale imbalanced datasets, and fine-tune alternative base learners. Finally, you'll apply advanced XGBoost techniques like building non-correlated ensembles, stacking models, and preparing models for industry deployment using sparse matrices, customized transformers, and pipelines. By the end of the book, you’ll be able to build high-performing machine learning models using XGBoost with minimal errors and maximum speed.
Table of Contents (15 chapters)
1
Section 1: Bagging and Boosting
6
Section 2: XGBoost
10
Section 3: Advanced XGBoost

Predicting regression

Machine learning algorithms aim to predict the values of one output column using data from one or more input columns. The predictions rely on mathematical equations determined by the general class of machine learning problems being addressed. Most supervised learning problems are classified as regression or classification. In this section, machine learning is introduced in the context of regression.

Predicting bike rentals

In the bike rentals dataset, df_bikes['cnt'] is the number of bike rentals in a given day. Predicting this column would be of great use to a bike rental company. Our problem is to predict the correct number of bike rentals on a given day based on data such as whether this day is a holiday or working day, forecasted temperature, humidity, windspeed, and so on.

According to the dataset, df_bikes['cnt'] is the sum of df_bikes['casual'] and df_bikes['registered']. If df_bikes['registered'] and df_bikes['casual'] were included as input columns, predictions would always be 100% accurate since these columns would always sum to the correct result. Although perfect predictions are ideal in theory, it makes no sense to include input columns that would be unknown in reality.

All current columns may be used to predict df_bikes['cnt'] except for 'casual' and 'registered', as explained previously. Drop the 'casual' and 'registered' columns using the .drop method as follows:

df_bikes = df_bikes.drop(['casual', 'registered'], axis=1)

The dataset is now ready.

Saving data for future use

The bike rentals dataset will be used multiple times in this book. Instead of running this notebook each time to perform data wrangling, you can export the clean dataset to a CSV file for future use: 

df_bikes.to_csv('bike_rentals_cleaned.csv', index=False)

The index=False parameter prevents an additional column from being created by the index.

Declaring predictor and target columns

Machine learning works by performing mathematical operations on each of the predictor columns (input columns) to determine the target column (output column).

It's standard to group the predictor columns with a capital X, and the target column as a lowercase y. Since our target column is the last column, splitting the data into predictor and target columns may be done via slicing using index notation:

X = df_bikes.iloc[:,:-1]y = df_bikes.iloc[:,-1]

The comma separates columns from rows. The first colon, :, means that all rows are included. After the comma, :-1 means start at the first column and go all the way to the last column without including it. The second -1 takes the last column only.

Understanding regression

Predicting the number of bike rentals, in reality, could result in any non-negative integer. When the target column includes a range of unlimited values, the machine learning problem is classified as regression.

The most common regression algorithm is linear regression. Linear regression takes each predictor column as a polynomial variable and multiplies the values by coefficients (also called weights) to predict the target column. Gradient descent works under the hood to minimize the error. The predictions of linear regression could be any real number.

Before running linear regression, we must split the data into a training set and a test set. The training set fits the data to the algorithm, using the target column to minimize the error. After a model is built, it's scored against the test data. 

The importance of holding out a test set to score the model cannot be overstated. In the world of big data, it's common to overfit the data to the training set because there are so many data points to train on. Overfitting is generally bad because the model adjusts itself too closely to outliers, unusual instances, and temporary trends. Strong machine learning models strike a nice balance between generalizing well to new data and accurately picking up on the nuances of the data at hand, a concept explored in detail in Chapter 2, Decision Trees in Depth.

Accessing scikit-learn

All machine learning libraries will be handled through scikit-learn. Scikit-learn's range, ease of use, and computational power place it among the most widespread machine learning libraries in the world.

Import train_test_split and LinearRegression from scikit-learn as follows:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Next, split the data into the training set and test set:

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

Note the random_state=2 parameter. Whenever you see random_state=2, this means that you are choosing the seed of a pseudo-random number generator to ensure reproducible results.

Silencing warnings

Before building your first machine learning model, silence all warnings. Scikit-learn includes warnings to notify users of future changes. In general, it's not advisable to silence warnings, but since our code has been tested, it's recommended to save space in your Jupyter Notebook.

Warnings may be silenced as follows:

import warnings
warnings.filterwarnings('ignore')

It's time to build your first model.

Modeling linear regression

A linear regression model may be built with the following steps:

  1. Initialize a machine learning model:

    lin_reg = LinearRegression()
  2. Fit the model on the training set. This is where the machine learning model is built. Note that X_train is the predictor column and y_train is the target column.

    lin_reg.fit(X_train, y_train)
  3. Make predictions for the test set. The predictions of X_test, the predictor columns in the test set, are stored as y_pred using the .predict method on lin_reg:

    y_pred = lin_reg.predict(X_test)
  4. Compare the predictions with the test set. Scoring the model requires a basis of comparison. The standard for linear regression is the root mean squared error (RMSE). The RMSE requires two pieces: mean_squared_error, the sum of the squares of differences between predicted and actual values, and the square root, to keep the units the same. mean_squared_error may be imported, and the square root may be taken with Numerical Python, popularly known as NumPy, a blazingly fast library designed to work with pandas.

  5. Import mean_squared_error and NumPy, and then compute the mean squared error and take the square root:

    from sklearn.metrics import mean_squared_error
    import numpy as np
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
  6. Print your results:

    print("RMSE: %0.2f" % (rmse))

    The outcome is as follows:

    RMSE: 898.21

    Here is a screenshot of all the code to build your first machine learning model:

Figure 1.10 – Code to build your machine learning model

Figure 1.10 – Code to build your machine learning model

It's hard to know whether an error of 898 rentals is good or bad without knowing the expected range of rentals per day.

The .describe() method may be used on the df_bikes['cnt'] column to obtain the range and more:

df_bikes['cnt'].describe()

Here is the output:

count     731.000000
mean     4504.348837
std      1937.211452
min        22.000000
25%      3152.000000
50%      4548.000000
75%      5956.000000
max      8714.000000
Name: cnt, dtype: float64

With a range of 22 to 8714, a mean of 4504, and a standard deviation of 1937, an RMSE of 898 isn't bad, but it's not great either.

XGBoost

Linear regression is one of many algorithms that may be used to solve regression problems. It's possible that other regression algorithms will produce better results. The general strategy is to experiment with different regressors to compare scores. Throughout this book, you will experiment with a wide range of regressors, including decision trees, random forests, gradient boosting, and the focus of this book, XGBoost.

A comprehensive introduction to XGBoost will be provided later in this book. For now, note that XGBoost includes a regressor, called XGBRegressor, that may be used on any regression dataset, including the bike rentals dataset that has just been scored. Let's now use the XGBRegressor to compare results on the bike rentals dataset with linear regression.

You should have already installed XGBoost in the preface. If you have not done so, install XGBoost now.

XGBRegressor

After XGBoost has been installed, the XGBoost regressor may be imported as follows:

from xgboost import XGBRegressor

The general steps for building XGBRegressor are the same as with LinearRegression. The only difference is to initialize XGBRegressor instead of LinearRegression:

  1. Initialize a machine learning model:

    xg_reg = XGBRegressor()
  2. Fit the model on the training set. If you get some warnings from XGBoost here, don't worry:

    xg_reg.fit(X_train, y_train)
  3. Make predictions for the test set:

    y_pred = xg_reg.predict(X_test)
  4. Compare the predictions with the test set:

    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
  5. Print your results:

    print("RMSE: %0.2f" % (rmse))

    The output is as follows:

    RMSE: 705.11

XGBRegressor performs substantially better!

The reason why XGBoost often performs better than others will be explored in Chapter 5, XGBoost Unveiled.

Cross-validation

One test score is not reliable because splitting the data into different training and test sets would give different results. In effect, splitting the data into a training set and a test set is arbitrary, and a different random_state will give a different RMSE.

One way to address the score discrepancies between different splits is k-fold cross-validation. The idea is to split the data multiple times into different training sets and test sets, and then to take the mean of the scores. The number of splits, called folds, is denoted by k. It's standard to use k = 3, 4, 5, or 10 splits.

Here is a visual description of cross-validation:

Figure 1.11 – Cross-validation

Figure 1.11 – Cross-validation

(Redrawn from https://commons.wikimedia.org/wiki/File:K-fold_cross_validation_EN.svg)

Cross-validation works by fitting a machine learning model on the first training set and scoring it against the first test set. A different training set and test set are provided for the second split, resulting in a new machine learning model with its own score. A third split results in a new model and scores it against another test set.

There is going to be overlap in the training sets, but not the test sets.

Choosing the number of folds is flexible and depends on the data. Five folds is standard because 20% of the test set is held back each time. With 10 folds, only 10% of the data is held back; however, 90% of the data is available for training and the mean is less vulnerable to outliers. For a smaller datatset, three folds may work better.

At the end, there will be k different scores evaluating the model against k different test sets. Taking the mean score of the k folds gives a more reliable score than any single fold.

cross_val_score is a convenient way to implement cross-validation. cross_val_score takes a machine learning algorithm as input, along with the predictor and target columns, with optional additional parameters that include a scoring metric and the desired number of folds.

Cross-validation with linear regression

Let's use cross-validation with LinearRegression.

First, import cross_val_score from the cross_val_score library:

from sklearn.model_selection import cross_val_score

Now use cross-validation to build and score a machine learning model in the following steps:

  1. Initialize a machine learning model:

    model = LinearRegression()
  2. Implement cross_val_score with the model, X, y, scoring='neg_mean_squared_error', and the number of folds, cv=10, as input:

    scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=10)

    Tip

    Why scoring='neg_mean_squared_error'? Scikit-learn is designed to select the highest score when training models. This works well for accuracy, but not for errors when the lowest is best. By taking the negative of each mean squared error, the lowest ends up being the highest. This is compensated for later with rmse = np.sqrt(-scores), so the final results are positive.

  3. Find the RMSE by taking the square root of the negative scores:

    rmse = np.sqrt(-scores)
  4. Display the results:

    print('Reg rmse:', np.round(rmse, 2))
    print('RMSE mean: %0.2f' % (rmse.mean()))

    The output is as follows:

    Reg rmse: [ 504.01  840.55 1140.88  728.39  640.2   969.95 
    1133.45 1252.85 1084.64  1425.33]
    RMSE mean: 972.02

Linear regression has a mean error of 972.06. This is slightly better than the 980.38 obtained before. The point here is not whether the score is better or worse. The point is that it's a better estimation of how linear regression will perform on unseen data.

Using cross-validation is always recommended for a better estimate of the score.

About the print function

When running your own machine learning code, the global print function is often not necessary, but it is helpful if you want to print out multiple lines and format the output as shown here.

Cross-validation with XGBoost

Now let's use cross-validation with XGBRegressor. The steps are the same, except for initializing the model:

  1. Initialize a machine learning model:

    model = XGBRegressor()
  2. Implement cross_val_score with the model, X, y, scoring, and the number of folds, cv, as input:

    scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=10)
  3. Find the RMSE by taking the square root of the negative scores:

    rmse = np.sqrt(-scores)
  4. Print the results:

    print('Reg rmse:', np.round(rmse, 2))
    print('RMSE mean: %0.2f' % (rmse.mean()))

    The output is as follows:

    Reg rmse: [ 717.65  692.8   520.7   737.68  835.96 1006.24  991.34  747.61  891.99 1731.13]
    RMSE mean: 887.31

XGBRegressor wins again, besting linear regression by about 10%.