Book Image

Hands-On Gradient Boosting with XGBoost and scikit-learn

By : Corey Wade
Book Image

Hands-On Gradient Boosting with XGBoost and scikit-learn

By: Corey Wade

Overview of this book

XGBoost is an industry-proven, open-source software library that provides a gradient boosting framework for scaling billions of data points quickly and efficiently. The book introduces machine learning and XGBoost in scikit-learn before building up to the theory behind gradient boosting. You’ll cover decision trees and analyze bagging in the machine learning context, learning hyperparameters that extend to XGBoost along the way. You’ll build gradient boosting models from scratch and extend gradient boosting to big data while recognizing speed limitations using timers. Details in XGBoost are explored with a focus on speed enhancements and deriving parameters mathematically. With the help of detailed case studies, you’ll practice building and fine-tuning XGBoost classifiers and regressors using scikit-learn and the original Python API. You'll leverage XGBoost hyperparameters to improve scores, correct missing values, scale imbalanced datasets, and fine-tune alternative base learners. Finally, you'll apply advanced XGBoost techniques like building non-correlated ensembles, stacking models, and preparing models for industry deployment using sparse matrices, customized transformers, and pipelines. By the end of the book, you’ll be able to build high-performing machine learning models using XGBoost with minimal errors and maximum speed.
Table of Contents (15 chapters)
1
Section 1: Bagging and Boosting
6
Section 2: XGBoost
10
Section 3: Advanced XGBoost

Setting up your coding environment

The following table summarizes the essential software used in this book.

Here are instructions for uploading this software to your system.

Anaconda

The data science libraries that you will need in this book along with Jupyter Notebooks, scikit-learn (sklearn), and Python may be installed together using Anaconda, which is recommended.

Here are the steps to install Anaconda on your computer as of 2020:

  1. Go to https://www.anaconda.com/products/individual.

  2. Click Download on the following screen, which does not yet start the download, but presents you with a variety of options (see step 3):

    Figure 0.1 – Preparing to download Anaconda

  3. Select your installer. The 64-Bit Graphical Installer is recommended for Windows and Mac. Make sure that you select from the top two rows under Python 3.7 since Python 3.7 is used throughout this book:

    Figure 0.2 – Anaconda Installers

  4. After your download begins, continue with the prompts on your computer to complete the installation:

    Warning for Mac users

    If you run into the error You cannot install Anaconda3 in this location, do not panic. Just click on the highlighted row Install for me only and the Continue button will present as an option.

Figure 0.3 – Warning for Mac Users – Just click Install for me only then Continue

Using Jupyter notebooks

Now that you have Anaconda installed, you may open a Jupyter notebook to use Python 3.7. Here are the steps to open a Jupyter notebook:

  1. Click on Anaconda-Navigator on your computer.

  2. Click Launch under Jupyter Notebook as shown in the following screenshot:

    Figure 0.4 – Anaconda home screen

    This should open a Jupyter notebook in a browser window. While Jupyter notebooks appear in web browsers for convenience, they are run on your personal computer, not online. Google Colab notebooks are an acceptable online alternative, but in this book, Jupyter notebooks are used exclusively.

  3. Select Python 3 from the New tab present on the right side of your Jupyter notebook as shown in the following screenshot:

Figure 0.5 – Jupyter notebook home screen

This should bring you to the following screen:

Figure 0.6 – Inside a Jupyter notebook

Congratulations! You are now ready to run Python code! Just type anything in the cell, such as print('hello xgboost!'), and press Shift + Enter to run the code.

Troubleshooting Jupyter notebooks

If you have trouble running or installing Jupyter notebooks, please visit Jupyter's official troubleshooting guide: https://jupyter-notebook.readthedocs.io/en/stable/troubleshooting.html.

XGBoost

At the time of writing, XGBoost is not yet included in Anaconda so it must be installed separately.

Here are the steps for installing XGBoost on your computer:

  1. Go to https://anaconda.org/conda-forge/xgboost. Here is what you should see:

    Figure 0.7 – Anaconda recommendations to install XGBoost

  2. Copy the first line of code in the preceding screenshot, as shown here:

    Figure 0.8 – Package installation

  3. Open the Terminal on your computer.

    If you do not know where your Terminal is located, search Terminal for Mac and Windows Terminal for Windows.

  4. Paste the following code into your Terminal, press Enter, and follow any prompts:

    conda install -c conda-forge xgboost
  5. Verify that the installation has worked by opening a new Jupyter notebook as outlined in the previous section. Then enter import xgboost and press Shift + Enter. You should see the following:

Figure 0.9 – Successful import of XGBoost in a Jupyter notebook

If you got no errors, congratulations! You now have all the necessary technical requirements to run code in this book.

Tip

If you received errors trying to set up your coding environment, please go back through the previous steps, or consider reviewing the Anaconda error documentation presented here: https://docs.anaconda.com/anaconda/user-guide/troubleshooting/. Previous users of Anaconda should update Anaconda by entering conda update conda in the Terminal. If you have trouble uploading XGBoost, see the official documentation at https://xgboost.readthedocs.io/en/latest/build.html.

Versions

Here is code that you may run in a Jupyter notebook to see what versions of the following software you are using:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import sklearn; print("Scikit-Learn", sklearn.__version__)
import xgboost; print("XGBoost", xgboost.__version__)

Here are the versions used to generate code in this book:

Darwin-19.6.0-x86_64-i386-64bit
Python 3.7.7 (default, Mar 26 2020, 10:32:53) 
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.19.1
SciPy 1.5.2
Scikit-Learn 0.23.2
XGBoost 1.2.0

It's okay if you have different versions than ours. Software is updated all the time, and you may obtain better results by using newer versions when released. If you are using older versions, however, it's recommended that you update using Anaconda by running conda update conda in the terminal. You may also run conda update xgboost if you installed an older version of XGBoost previously and forged it with Anaconda as outlined in the previous section.

Accessing code files

If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!