Book Image

The Kaggle Workbook

By : Konrad Banachewicz, Luca Massaron
5 (1)
Book Image

The Kaggle Workbook

5 (1)
By: Konrad Banachewicz, Luca Massaron

Overview of this book

More than 80,000 Kaggle novices currently participate in Kaggle competitions. To help them navigate the often-overwhelming world of Kaggle, two Grandmasters put their heads together to write The Kaggle Book, which made plenty of waves in the community. Now, they’ve come back with an even more practical approach based on hands-on exercises that can help you start thinking like an experienced data scientist. In this book, you’ll get up close and personal with four extensive case studies based on past Kaggle competitions. You’ll learn how bright minds predicted which drivers would likely avoid filing insurance claims in Brazil and see how expert Kagglers used gradient-boosting methods to model Walmart unit sales time-series data. Get into computer vision by discovering different solutions for identifying the type of disease present on cassava leaves. And see how the Kaggle community created predictive algorithms to solve the natural language processing problem of subjective question-answering. You can use this workbook as a supplement alongside The Kaggle Book or on its own alongside resources available on the Kaggle website and other online communities. Whatever path you choose, this workbook will help make you a formidable Kaggle competitor.
Table of Contents (7 chapters)

Building a LightGBM submission

Our exercise starts by working out a solution based on LightGBM. You can find the code already set for execution using Kaggle Notebooks at this address: Although we made the code readily available, we instead suggest you type or copy the code directly from the book and execute it cell by cell; understanding what each line of code does and personalizing the solution can make it perform even better.

When using LightGBM you don’t have to, and should not, turn on any of the GPU or TPU accelerators. GPU acceleration could be helpful only if you have installed the GPU version of LightGBM. You can find working hints on how to install such a GPU-accelerated version on Kaggle Notebooks using this example:

We start by importing key packages (NumPy, pandas, and Optuna for hyperparameter optimization, LightGBM, and some utility functions). We also define a configuration class and instantiate it. We will discuss the parameters defined in the configuration class during the exploration of the code as we progress. What is important to remark here is that by using a class containing all your parameters it will be easier for you to modify them in a consistent way along with the code. In the heat of competition, it is easy to forget to update a parameter that is referred to in multiple places in the code, and it is always difficult to set the parameters when they are dispersed among cells and functions. A configuration class can save you a lot of effort and spare you mistakes along the way:

import numpy as np
import pandas as pd
import optuna
import lightgbm as lgb
from path import Path
from sklearn.model_selection import StratifiedKFold
class Config:
    input_path = Path('../input/porto-seguro-safe-driver-prediction')
    optuna_lgb = False
    n_estimators = 1500
    early_stopping_round = 150
    cv_folds = 5
    random_state = 0
    params = {'objective': 'binary',
              'boosting_type': 'gbdt',
              'learning_rate': 0.01,
              'max_bin': 25,
              'num_leaves': 31,
              'min_child_samples': 1500,
              'colsample_bytree': 0.7,
              'subsample_freq': 1,
              'subsample': 0.7,
              'reg_alpha': 1.0,
              'reg_lambda': 1.0,
              'verbosity': 0,
              'random_state': 0}
config = Config()

The next step requires importing the training, test, and sample submission datasets. We do this using the pandas read_csv function. We also set the index of the uploaded DataFrames to the identifier (the id column) of each data example.

Since features that belong to similar groupings are tagged (using ind, reg, car, and calc tags in their labels) and also binary and categorical features are easy to locate (they use the bin and cat tags, respectively, in their labels), we can enumerate them and record them in lists:

train = pd.read_csv(config.input_path / 'train.csv', index_col='id')
test = pd.read_csv(config.input_path / 'test.csv', index_col='id')
submission = pd.read_csv(config.input_path / 'sample_submission.csv', index_col='id')
calc_features = [feat for feat in train.columns if "_calc" in feat]
cat_features = [feat for feat in train.columns if "_cat" in feat]

Then, we just extract the target (a binary target of 0s and 1s) and remove it from the training dataset:

target = train["target"]
train = train.drop("target", axis="columns")

At this point, as pointed out by Michael Jahrer, we can drop the calc features. This idea has recurred a lot during the competition (, especially in notebooks, because it could be empirically verified that dropping them improved both the local cross-validation score and the public leaderboard score (as a general rule, it’s important to keep track of both during feature selection). In addition, they also performed poorly in gradient boosting models (their importance is always below the average).

We can argue that, since they are engineered features, they do not contain new information in respect of their original features, but they just add noise to any model trained that comprises them:

train = train.drop(calc_features, axis="columns")
test = test.drop(calc_features, axis="columns")

Exercise 3

Based on the suggestions provided in The Kaggle Book on page 220 (Using feature importance to evaluate your work), as an exercise:

  1. Code your own feature selection notebook for this competition.
  2. Check what features should be kept and what should be discarded.

Exercise Notes (write down any notes or workings that will help you):

Categorical features are instead one-hot encoded. Because the same labels are present in the training and test datasets (the result of a careful train/test split between the two arranged by the Porto Seguro team), instead of the usual scikit-learn OneHotEncoder ( we are going to use the pandas get_dummies function ( Since the pandas function may produce different encodings if the features and their levels differ from train to test set, we assert a check on the one-hot encoding, resulting in the same for both:

train = pd.get_dummies(train, columns=cat_features)
test = pd.get_dummies(test, columns=cat_features)

One-hot encoding the categorical features completes the data processing stage. We proceed to define our evaluation metric, the normalized Gini coefficient, as previously discussed. We will use the extremely fast Gini computation code proposed by CPMP, as mentioned before.

Since we are going to use a LightGBM model, we have to add a suitable wrapper (gini_lgb) to return to the GBM algorithm the evaluation of the training and the validation datasets in a form that can work with it (see evaluation function should accept two parameters: preds, eval_data, and return (eval_name, eval_result, is_higher_better) or list of such tuples):

from numba import jit
def eval_gini(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_pred)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n-1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini
def gini_lgb(y_true, y_pred):
    eval_name = 'normalized_gini_coef'
    eval_result = eval_gini(y_true, y_pred)
    is_higher_better = True
    return eval_name, eval_result, is_higher_better

As for the training parameters, we found that the parameters suggested by Michael Jahrer in his post ( work perfectly.

You may also try to come up with the same parameters or similar performing ones by performing a search by Optuna ( if you set the optuna_lgb flag to True in the Config class. Here the optimization tries to find the best values for key parameters, such as the learning rate and the regularization parameters, based on a five-fold cross-validation test on training data. To speed up things, early stopping on the validation itself is taken into account (which, we are aware, could actually advantage picking some parameters that can better overfit the validation fold – a good alternative could be to remove the early stopping callback and keep a fixed number of rounds for the training):

if config.optuna_lgb:
    def objective(trial):
        params = {
    'learning_rate': trial.suggest_float("learning_rate", 0.01, 1.0),
    'num_leaves': trial.suggest_int("num_leaves", 3, 255),
    'min_child_samples': trial.suggest_int("min_child_samples", 
                                           3, 3000),
    'colsample_bytree': trial.suggest_float("colsample_bytree", 
                                            0.1, 1.0),
    'subsample_freq': trial.suggest_int("subsample_freq", 0, 10),
    'subsample': trial.suggest_float("subsample", 0.1, 1.0),
    'reg_alpha': trial.suggest_loguniform("reg_alpha", 1e-9, 10.0),
    'reg_lambda': trial.suggest_loguniform("reg_lambda", 1e-9, 10.0),
        score = list()
        skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, 
        for train_idx, valid_idx in skf.split(train, target):
            X_train = train.iloc[train_idx]
            y_train = target.iloc[train_idx]
            X_valid = train.iloc[valid_idx] 
            y_valid = target.iloc[valid_idx]
            model = lgb.LGBMClassifier(**params,
  , y_train, 
                      eval_set=[(X_valid, y_valid)],  
                      eval_metric=gini_lgb, callbacks=callbacks)
        return np.mean(score)
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=300)
    print("Best Gini Normalized Score", study.best_value)
    print("Best parameters", study.best_params)
    params = {'objective': 'binary',
              'boosting_type': 'gbdt',
              'verbosity': 0,
              'random_state': 0}
    params = config.params

During the competition, Tilii tested feature elimination using Boruta ( You can find his kernel here: As you can check, there is no calc_feature considered a confirmed feature by Boruta.

Exercise 4

In The Kaggle Book, we explain hyperparameter optimization (page 241 onward) and provide some key hyperparameters for the LightGBM model.

As an exercise:

Try to improve the hyperparameter search by Optuna by reducing or increasing the explored parameters where you deem it necessary, and also try alternative optimization methods, such as the random search or the halving search from scikit-learn (pages 245–246).

Exercise Notes (write down any notes or workings that will help you):

Once we have got our best parameters (or we simply try Jahrer’s ones), we are ready to train and predict. Our strategy, as suggested by the best solution, is to train a model on each cross-validation fold and use that fold to contribute to an average of test predictions. The snippet of code will produce both the test predictions and the out-of-fold predictions on the training dataset, which will be useful for figuring out how to ensemble the results:

preds = np.zeros(len(test))
oof = np.zeros(len(train))
metric_evaluations = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.random_state)
for idx, (train_idx, valid_idx) in enumerate(skf.split(train, 
    print(f"CV fold {idx}")
    X_train, y_train = train.iloc[train_idx], target.iloc[train_idx]
    X_valid, y_valid = train.iloc[valid_idx], target.iloc[valid_idx]
    model = lgb.LGBMClassifier(**params,
               lgb.log_evaluation(period=100, show_stdv=False)]
                                                                                      , y_train, 
              eval_set=[(X_valid, y_valid)], 
              eval_metric=gini_lgb, callbacks=callbacks)
    preds += (model.predict_proba(test,  
              / skf.n_splits)
    oof[valid_idx] = model.predict_proba(X_valid, 

The model training shouldn’t take too long. In the end you can get the reported Normalized Gini Coefficient obtained during the cross-validation procedure:

print(f"LightGBM CV normalized Gini coefficient: 

The results are quite encouraging because the average score is 0.289 and the standard deviation of the values is quite small:

LightGBM CV Gini Normalized Score: 0.289 (0.015)

All that is left is to save the out-of-fold and test predictions as a submission and to verify the results on the public and private leaderboards:

submission['target'] = preds
oofs = pd.DataFrame({'id':train_index, 'target':oof})
oofs.to_csv('lgb_oof.csv', index=False)

The obtained public score should be around 0.28442. The associated private score is about 0.29121, placing you in the 29th position on the final leaderboard. A quite good result, but we still have to blend it with a different model, a neural network.

Bagging the training set (i.e., taking multiple bootstraps of the training data and training multiple models based on the bootstraps) should increase the performance, although, as Michael Jahrer himself noted in his post, not all that much.