Book Image

The Kaggle Workbook

By : Konrad Banachewicz, Luca Massaron
5 (1)
Book Image

The Kaggle Workbook

5 (1)
By: Konrad Banachewicz, Luca Massaron

Overview of this book

More than 80,000 Kaggle novices currently participate in Kaggle competitions. To help them navigate the often-overwhelming world of Kaggle, two Grandmasters put their heads together to write The Kaggle Book, which made plenty of waves in the community. Now, they’ve come back with an even more practical approach based on hands-on exercises that can help you start thinking like an experienced data scientist. In this book, you’ll get up close and personal with four extensive case studies based on past Kaggle competitions. You’ll learn how bright minds predicted which drivers would likely avoid filing insurance claims in Brazil and see how expert Kagglers used gradient-boosting methods to model Walmart unit sales time-series data. Get into computer vision by discovering different solutions for identifying the type of disease present on cassava leaves. And see how the Kaggle community created predictive algorithms to solve the natural language processing problem of subjective question-answering. You can use this workbook as a supplement alongside The Kaggle Book or on its own alongside resources available on the Kaggle website and other online communities. Whatever path you choose, this workbook will help make you a formidable Kaggle competitor.
Table of Contents (7 chapters)

Setting up a denoising autoencoder and a DNN

The next step is to set up a denoising autoencoder (DAE) and a neural network that can learn and predict from it. You can find the running code in this notebook: The notebook can be run in GPU mode (it will therefore be speedier if you turn on the accelerators in the Kaggle Notebook), but it can also run in CPU mode with some slight modifications.

You can read more about denoising autoencoders being used in Kaggle competitions in The Kaggle Book, from page 230 onward.

Actually there are no examples that reproduce Michael Jahrer’s approach in the competition using DAEs, so we took an example from a TensorFlow implementation in another competition coded by OsciiArt (

Here we start by importing all the necessary packages, especially TensorFlow and Keras. Since we are going to create multiple neural networks, we point out to TensorFlow not to use all the GPU memory available by using the experimental set_memory_growth command. This will help us avoid having memory overflow problems along the way. We also record the Leaky ReLu activation as a custom one, so we can just mention it as an activation by a string in the Keras layers:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from path import Path
import gc
import optuna
from sklearn.model_selection import StratifiedKFold
from scipy.special import erfinv
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Input, Dense, BatchNormalization, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.regularizers import l2
from tensorflow.keras.metrics import AUC
from tensorflow.keras.utils import get_custom_objects
from tensorflow.keras.layers import Activation, LeakyReLU
get_custom_objects().update({'leaky-relu': Activation(LeakyReLU(alpha=0.2))})

Related to our intention of creating multiple neural networks without running out of memory, we also define a simple function for cleaning the memory in GPU and removing models that are no longer needed:

def gpu_cleanup(objects):
    if objects:

We also reconfigure the Config class to take into account multiple parameters related to the denoising autoencoder and the neural network. As previously stated about the LightGBM, having all the parameters in a unique place simplifies the process when you have to change them in a consistent way:

class Config:
    input_path = Path('../input/porto-seguro-safe-driver-prediction')
    dae_batch_size = 128
    dae_num_epoch = 50
    dae_architecture = [1500, 1500, 1500]
    reuse_autoencoder = False
    batch_size = 128
    num_epoch = 150
    units = [64, 32]
    cv_folds = 5
    nas = False
    random_state = 0
config = Config()

As shown previously, we load the datasets and proceed to process the features by removing the calc features and one-hot encoding the categorical ones. We leave missing cases valued at -1, as Michael Jahrer pointed out in his solution:

train = pd.read_csv(config.input_path / 'train.csv', index_col='id')
test = pd.read_csv(config.input_path / 'test.csv', index_col='id')
submission = pd.read_csv(config.input_path / 'sample_submission.csv', index_col='id')
calc_features = [feat for feat in train.columns if "_calc" in feat]
cat_features = [feat for feat in train.columns if "_cat" in feat]
target = train["target"]
train = train.drop("target", axis="columns")
train = train.drop(calc_features, axis="columns")
test = test.drop(calc_features, axis="columns")
train = pd.get_dummies(train, columns=cat_features)
test = pd.get_dummies(test, columns=cat_features)

However, since we are dealing with neural networks, we have to normalize all the features that are not binary or one-hot-encoded categorical. Normalization implies rescaling (setting a limited range of values) and centering (your distribution will be centered to a certain value, usually zero)

Normalization will allow the optimization algorithm of both the autoencoder and the neural network to converge to a good solution faster because it reduces the danger of oscillations of the loss function during the optimization. In addition, normalization facilitates the propagation of the input through the activation functions.

Instead of using statistical normalization (bringing your distribution of values to have zero mean and unit standard deviation), GaussRank is a procedure that also allows the modification of the distribution of the variables into a transformed Gaussian one. As also stated in some papers, such as in Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (, neural networks perform even better if you provide them with a Gaussian input. Accordingly to this NVIDIA blog post,, GaussRank works most of the time, except when features are already normally distributed or are extremely asymmetrical (in such cases applying the transformation may lead to worsened performance):

print("Applying GaussRank to columns: ", end='')
to_normalize = list()
for k, col in enumerate(train.columns):
    if '_bin' not in col and '_cat' not in col and '_missing' not in col:
def to_gauss(x): return np.sqrt(2) * erfinv(x) 
def normalize(data, norm_cols):
    n = data.shape[0]
    for col in norm_cols:
        sorted_idx = data[col].sort_values().index.tolist()
        uniform = np.linspace(start=-0.99, stop=0.99, num=n)
        normal = to_gauss(uniform)
        normalized_col = pd.Series(index=sorted_idx, data=normal)
        data[col] = normalized_col
    return data
train = normalize(train, to_normalize)
test = normalize(test, to_normalize)

We can apply the GaussRank transformation separately on the train and test features on all the numeric features of our dataset:

Applying GaussRank to columns: ['ps_ind_01', 'ps_ind_03', 'ps_ind_14', 'ps_ind_15', 'ps_reg_01', 'ps_reg_02', 'ps_reg_03', 'ps_car_11', 'ps_car_12', 'ps_car_13', 'ps_car_14', 'ps_car_15']

When normalizing the features, we simply turn our data into a NumPy array of float32 values, the ideal input for a GPU:

features = train.columns
train_index = train.index
test_index = test.index
train = train.values.astype(np.float32)
test = test.values.astype(np.float32)

Next, we just prepare some useful functions, such as the evaluation function, the normalized Gini coefficient (based on the code described before), and a plotting function that helpfully represents a Keras model history of fitting on both training and validation sets:

def plot_keras_history(history, measures):
    rows = len(measures) // 2 + len(measures) % 2
    fig, panels = plt.subplots(rows, 2, figsize=(15, 5))
    plt.subplots_adjust(top = 0.99, bottom=0.01, 
                        hspace=0.4, wspace=0.2)
        panels = [item for sublist in panels for item in sublist]
    for k, measure in enumerate(measures):
        panel = panels[k]
        panel.set_title(measure + ' history')
        panel.plot(history.epoch, history.history[measure],  
                   label="Train "+measure)
                       label="Validation "+measure)
        panel.set(xlabel='epochs', ylabel=measure)
from numba import jit
def eval_gini(y_true, y_pred):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_pred)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n-1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini

The next functions are actually a bit more complex and more related to the functioning of both the denoising autoencoder and the supervised neural network. The batch_generator is a function that will create a generator that provides shuffled chunks of the data based on batch size. It isn’t actually used as a standalone generator but as part of a more complex batch generator that we will soon describe, the mixup_generator:

def batch_generator(x, batch_size, shuffle=True, random_state=None):
    batch_index = 0
    n = x.shape[0]
    while True:
        if batch_index == 0:
            index_array = np.arange(n)
            if shuffle:
                index_array = np.random.permutation(n)
        current_index = (batch_index * batch_size) % n
        if n >= current_index + batch_size:
            current_batch_size = batch_size
            batch_index += 1
            current_batch_size = n - current_index
            batch_index = 0
        batch = x[index_array[current_index: current_index + current_batch_size]]
        yield batch

The mixup_generator is a generator that returns batches of data whose values have been partially swapped to create some noise and augment the data to avoid the DAE overfitting to the training dataset. You can look at this generator as a way to inject random values into the dataset and create many more examples to be used for training. It works based on a swap rate, fixed at 15%, of features as suggested by Michael Jahrer, implying that at every batch, you will have 15% of the random values in the sample. It is also important to point out that having the random values picked randomly from the very same features means that the replacing random values are not completely random, since they are from the same distribution of the original features.

The function generates two distinct batches of data, one to be released to the model and another to be used as a source for the value to be swapped in the batch to be released. Based on a random choice, whose base probability is the swap rate, at each batch, a certain number of features will be swapped between the two batches.

This means that the DAE cannot always rely on the same features (since they can be randomly swapped from time to time) but instead has to concentrate on the whole of the features (something similar to dropout in a certain sense) to find relationships between them and correctly reconstruct the data at the end of the process:

def mixup_generator(X, batch_size, swaprate=0.15, shuffle=True, random_state=None):
    if random_state is None:
        random_state = np.randint(0, 999)
    num_features = X.shape[1]
    num_swaps = int(num_features * swaprate)    
    generator_a = batch_generator(X, batch_size, shuffle, 
    generator_b = batch_generator(X, batch_size, shuffle, 
                                  random_state + 1)
    while True:
        batch = next(generator_a)
        mixed_batch = batch.copy()
        effective_batch_size = batch.shape[0]
        alternative_batch = next(generator_b)
        assert((batch != alternative_batch).any())
        for i in range(effective_batch_size):
            swap_idx = np.random.choice(num_features, num_swaps, 
            mixed_batch[i, swap_idx] = alternative_batch[i, swap_idx]
        yield (mixed_batch, batch)

The get_DAE is the function that builds the denoising autoencoder. It accepts a parameter for defining the architecture, which in our case has been set to three layers of 1,500 nodes each (as suggested by Michael Jahrer’s solution). The first layer should act as an encoder, the second is a bottleneck layer ideally containing the latent features capable of expressing the information in the data, and the last layer is a decoding layer capable of reconstructing the initial input data. The three layers have a relu activation function, no bias, and each one is followed by a batch normalization layer. The final output with the reconstructed input data has a linear activation. The training is optimized using an adam optimizer with standard settings (the optimized cost function is the mean squared error – mse):

def get_DAE(X, architecture=[1500, 1500, 1500]):
    features = X.shape[1]
    inputs = Input((features,))
    for i, nodes in enumerate(architecture):
        layer = Dense(nodes, activation='relu', 
                      use_bias=False, name=f"code_{i+1}")
        if i==0:
            x = layer(inputs)
            x = layer(x)
        x = BatchNormalization()(x)
    outputs = Dense(features, activation='linear')(x)
    model = Model(inputs=inputs, outputs=outputs)
    model.compile(optimizer='adam', loss='mse', 
                  metrics=['mse', 'mae'])
    return model

The extract_dae_features function is reported here only for educational purposes. The function helps in the extraction of the values of specific layers of the trained denoising autoencoder. The extraction works by building a new model, combining the DAE input layer and the desired output one. A simple predict will then extract the values we need (the predict also allows us to fix the preferred batch size in order to fit any memory requirement).

In the case of the competition, given the number of observations and the number of features to be taken out from the autoencoder, if we were to use this function, the resulting dense matrix would be too large to be handled by the memory of a Kaggle Notebook. For this reason, our strategy won’t be to transform the original data into the autoencoder node values of the bottleneck layer but to instead fuse the autoencoder with its frozen layers up to the bottleneck with the supervised neural network, as we will be discussing soon:

def extract_dae_features(autoencoder, X, layers=[3], batch_size=128):
    data = []
    for layer in layers:
        if layer==0:
            get_layer_output = Model([autoencoder.layers[0].input], 
            layer_output = get_layer_output.predict(X, 
                                              batch_size= batch_size)
    data = np.hstack(data)
    return data

To complete the work with the DAE, we have a final function wrapping all the previous ones into an unsupervised training procedure (at least partially unsupervised since there is an early stop monitor set on a validation set). The function sets up the mix-up generator, creates the denoising autoencoder architecture, and then trains it, monitoring its fit on a validation set for an early stop if there are signs of overfitting. Finally, before returning the trained DAE, it plots a graph of the training and validation fit and stores the model on disk.

Even if we try to fix a seed on this model, contrary to the LightGBM model, the results are extremely variable, and they may influence the final ensemble results. Though the result will be a high scoring one, it may land higher or lower on the public and private leaderboards (public results are very correlated to the private leaderboard) and it will be easy for you to always pick up the best final submission based on its public results:

def autoencoder_fitting(X_train, X_valid, filename='dae',  
                        random_state=None, suppress_output=False):
    if suppress_output:
        verbose = 0
        verbose = 2
        print("Fitting a denoising autoencoder")
    generator = mixup_generator(X_train, 
    dae = get_DAE(X_train, architecture=config.dae_architecture)
    steps_per_epoch = np.ceil(X_train.shape[0] / 
    early_stopping = EarlyStopping(monitor='val_mse', 
    history =,
                    validation_data=(X_valid, X_valid),
    if not suppress_output: plot_keras_history(history, 
                                           measures=['mse', 'mae'])
    return dae

Having dealt with the DAE, we take the chance also to define the supervised neural model down the line that should predict our claim expectations. As a first step, we define a function to define a single layer of the work:

  • Random normal initialization, since empirically it has been found to converge to better results in this problem.
  • A dense layer with L2 regularization and a customizable activation function.
  • A tunable dropout layer, which can be easily included or excluded from the architecture.

Here is the code for creating the dense blocks:

def dense_blocks(x, units, activation, regL2, dropout):
    kernel_initializer = keras.initializers.RandomNormal(mean=0.0, 
                                stddev=0.1, seed=config.random_state)
    for k, layer_units in enumerate(units):
        if regL2 > 0:
            x = Dense(layer_units, activation=activation, 
            x = Dense(layer_units, 
        if dropout > 0:
            x = Dropout(dropout)(x)
    return x

As you may have already noticed, the function defining the single layer is quite customizable. The same goes for the wrapper architecture function, taking inputs for the number of layers and units in them, dropout probabilities, regularization, and activation type. The idea is to be able to run a neural architecture search (NAS) and figure out what configuration should perform better in our problem.

As a final note on the function, among the inputs, it is required to provide the trained DAE because its inputs are used as the neural network model inputs while its first layers are connected to the DAE’s bottleneck layer (the middle layer in the DAE architecture). In such a way we are de facto concatenating the two models into one (although the DAE weights are frozen anyway and not trainable).

This solution has been devised to avoid having to transform all your training data and instead only the single batches that the neural network is processing, thus saving memory in the system:

def dnn_model(dae, units=[4500, 1000, 1000], 
            input_dropout=0.1, dropout=0.5,
    inputs = dae.get_layer("code_2").output
    if input_dropout > 0:
        x = Dropout(input_dropout)(inputs)
        x = tf.keras.layers.Layer()(inputs)
    x = dense_blocks(x, units, activation, regL2, dropout)
    outputs = Dense(1, activation='sigmoid')(x)
    model = Model(inputs=dae.input, outputs=outputs)
    return model

We conclude with a wrapper for the training process, including all the steps in order to train the entire pipeline on a cross-validation fold:

def model_fitting(X_train, y_train, X_valid, y_valid, autoencoder, 
                 filename, random_state=None, suppress_output=False):
        if suppress_output:
            verbose = 0
            verbose = 2
            print("Fitting model")
        early_stopping = EarlyStopping(monitor='val_auc', 
        rlrop = ReduceLROnPlateau(monitor='val_auc', 
        model = dnn_model(autoencoder,
        history =, y_train, 
                            validation_data=(X_valid, y_valid),
                            callbacks=[early_stopping, rlrop],
        if not suppress_output:  
            plot_keras_history(history, measures=['loss', 'auc'])
        return model, history

Since our DAE implementation is surely different from Jahrer’s, although the idea behind it is the same, we cannot rely completely on his observations on the architecture of the supervised neural network, and we have to look for the ideal indications as we have been looking for the best hyperparameters in the LightGBM model. Using Optuna and leveraging the multiple parameters that we set to configure the network’s architecture, we can run this code snippet for some hours and get an idea about what could work better.

In our experiments we found that:

  • We should use a two-layer network with fewer nodes, 64 and 32 respectively.
  • Input dropout, dropout between layers, and some L2 regularization do help.
  • It is better to use the SELU activation function.

Here is the code snippet for running the entire optimization experiments:

if config.nas is True:
    def evaluate():
        metric_evaluations = list()
        skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.random_state)
        for k, (train_idx, valid_idx) in enumerate(skf.split(train, target)):
            X_train, y_train = train[train_idx, :], target[train_idx]
            X_valid, y_valid = train[valid_idx, :], target[valid_idx]
            if config.reuse_autoencoder:
                autoencoder = load_model(f"./dae_fold_{k}")
                autoencoder = autoencoder_fitting(X_train, X_valid,
            model, _ = model_fitting(X_train, y_train, X_valid, y_valid,
            val_preds = model.predict(X_valid, batch_size=128, verbose=0)
            best_score = eval_gini(y_true=y_valid, y_pred=np.ravel(val_preds))
            gpu_cleanup([autoencoder, model])
        return np.mean(metric_evaluations)
    def objective(trial):
        params = {
                'first_layer': trial.suggest_categorical("first_layer", [8, 16, 32, 64, 128, 256, 512]),
                'second_layer': trial.suggest_categorical("second_layer", [0, 8, 16, 32, 64, 128, 256]),
                'third_layer': trial.suggest_categorical("third_layer", [0, 8, 16, 32, 64, 128, 256]),
                'input_dropout': trial.suggest_float("input_dropout", 0.0, 0.5),
                'dropout': trial.suggest_float("dropout", 0.0, 0.5),
                'regL2': trial.suggest_uniform("regL2", 0.0, 0.1),
                'activation': trial.suggest_categorical("activation", ['relu', 'leaky-relu', 'selu'])
        config.units = [nodes for nodes in [params['first_layer'], params['second_layer'], params['third_layer']] if nodes > 0]
        config.input_dropout = params['input_dropout']
        config.dropout = params['dropout']
        config.regL2 = params['regL2']
        config.activation = params['activation']
        return evaluate()
    study = optuna.create_study(direction="maximize")
    study.optimize(objective, n_trials=60)
    print("Best Gini Normalized Score", study.best_value)
    print("Best parameters", study.best_params)
    config.units = [nodes for nodes in [study.best_params['first_layer'], study.best_params['second_layer'], study.best_params['third_layer']] if nodes > 0]
    config.input_dropout = study.best_params['input_dropout']
    config.dropout = study.best_params['dropout']
    config.regL2 = study.best_params['regL2']
    config.activation = study.best_params['activation']

Exercise 5

If you are looking for more information about NAS, you can have a look at The Kaggle Book, on page 276 onward. In the case of the DAE and the supervised neural network, it is critical to look for the best architecture since we are implementing something surely different from Michael Jahrer’s solution.

As an exercise, try to improve the hyperparameter search by using KerasTuner (to be found on page 285 onward in The Kaggle Book), a fast solution for optimizing neural networks that includes the contribution of François Chollet, the creator of Keras.

Exercise Notes (write down any notes or workings that will help you):

Having finally set everything ready, we are set to start the training. In about one hour, on a Kaggle Notebook with GPU, you can obtain complete test and out-of-fold predictions:

preds = np.zeros(len(test))
oof = np.zeros(len(train))
metric_evaluations = list()
skf = StratifiedKFold(n_splits=config.cv_folds, shuffle=True, random_state=config.random_state)
for k, (train_idx, valid_idx) in enumerate(skf.split(train, target)):
    print(f"CV fold {k}")
    X_train, y_train = train[train_idx, :], target[train_idx]
    X_valid, y_valid = train[valid_idx, :], target[valid_idx]
    if config.reuse_autoencoder:
        print("restoring previously trained dae")
        autoencoder = load_model(f"./dae_fold_{k}")
        autoencoder = autoencoder_fitting(X_train, X_valid,
    model, history = model_fitting(X_train, y_train, X_valid, y_valid,
    val_preds = model.predict(X_valid, batch_size=128)
    best_score = eval_gini(y_true=y_valid, 
    best_epoch = np.argmax(history.history['val_auc']) + 1
    print(f"[best epoch is {best_epoch}]\tvalidation_0-gini_dnn: {best_score:0.5f}\n")
    preds += (model.predict(test, batch_size=128).ravel() / 
    oof[valid_idx] = model.predict(X_valid, batch_size=128).ravel()
    gpu_cleanup([autoencoder, model])

As we did with the LighGBM model, we can get an idea of the results by looking at the average fold normalized Gini coefficient:

print(f"DNN CV normalized Gini coefficient: {np.mean(metric_evaluations):0.3f} ({np.std(metric_evaluations):0.3f})")

The results won’t be quite in line with what was previously obtained using the LightGBM:

DNN CV Gini Normalized Score: 0.276 (0.015)

Producing the submission and submitting it will result in a public score of about 0.27737 and a private score of about 0.28471 (results may vary wildly as we previously mentioned) – not quite a high score:

submission['target'] = preds
oofs = pd.DataFrame({'id':train_index, 'target':oof})
oofs.to_csv('dnn_oof.csv', index=False)

The scarce results from the neural network seem to confirm the idea that neural networks underperform in tabular problems. As Kagglers, anyway, we know that all models are useful for a successful placing on the leaderboard; we just need to figure out how to best use them. Surely, a neural network feed with an autoencoder has worked out a solution less affected by noise in data and elaborated the information in a different way than a GBM.