Book Image

The Kaggle Workbook

By : Konrad Banachewicz, Luca Massaron
5 (1)
Book Image

The Kaggle Workbook

5 (1)
By: Konrad Banachewicz, Luca Massaron

Overview of this book

More than 80,000 Kaggle novices currently participate in Kaggle competitions. To help them navigate the often-overwhelming world of Kaggle, two Grandmasters put their heads together to write The Kaggle Book, which made plenty of waves in the community. Now, they’ve come back with an even more practical approach based on hands-on exercises that can help you start thinking like an experienced data scientist. In this book, you’ll get up close and personal with four extensive case studies based on past Kaggle competitions. You’ll learn how bright minds predicted which drivers would likely avoid filing insurance claims in Brazil and see how expert Kagglers used gradient-boosting methods to model Walmart unit sales time-series data. Get into computer vision by discovering different solutions for identifying the type of disease present on cassava leaves. And see how the Kaggle community created predictive algorithms to solve the natural language processing problem of subjective question-answering. You can use this workbook as a supplement alongside The Kaggle Book or on its own alongside resources available on the Kaggle website and other online communities. Whatever path you choose, this workbook will help make you a formidable Kaggle competitor.
Table of Contents (7 chapters)

Understanding the evaluation metric

The metric used in the competition is the normalized Gini coefficient (named after the similar Gini coefficient/index used in economics), which has been previously used in another competition, the Allstate Claim Prediction Challenge (https://www.kaggle.com/competitions/ClaimPredictionChallenge). From that competition, we can get a very clear explanation of what this metric is about:

When you submit an entry, the observations are sorted from “largest prediction” to “smallest prediction.” This is the only step where your predictions come into play, so only the order determined by your predictions matters. Visualize the observations arranged from left to right, with the largest predictions on the left. We then move from left to right, asking “In the leftmost x% of the data, how much of the actual observed loss have you accumulated?” With no model, you can expect to accumulate 10% of the loss in 10% of the predictions, so no model (or a “null” model) achieves a straight line. We call the area between your curve and this straight line the Gini coefficient.

There is a maximum achievable area for a “perfect” model. We will use the normalized Gini coefficient by dividing the Gini coefficient of your model by the Gini coefficient of the perfect model.

There is no formulation proposed by the organizers of the competition for the Normalized Gini apart from this verbose description, but by reading the notebook from Mohsin Hasan (https://www.kaggle.com/code/tezdhar/faster-gini-calculation/notebook), we can figure out that it is calculated in two steps and can obtain some easy to understand pseudocode that reveals its inner workings. First, you get the Gini coefficient for your predictions, then you normalize it by dividing it by another Gini coefficient computed by pretending you have perfect predictions. Here is the pseudocode for the Gini coefficient:

order = indexes of sorted predictions (expressed as probabilities from lowest to highest)

sorted_actual = actual[order] = ground truth values sorted based on indexes of sorted predictions

cumsum_sorted_actual = cumulated sum of the sorted ground truth values

n = number of predictions

gini_coef = (sum(cumsum_sorted_actual ) / sum(sorted_actual ) - (n + 1) / 2) / n

Once you have the Gini coefficient for your predictions, you need to divide it by the Gini coefficient you compute using the ground truth values as they were your predictions (the case of having perfect predictions)

norm_gini_coef = gini_coef(predictions) / gini_coef(ground truth)

Another good explanation is provided in the notebook by Kilian Batzner: https://www.kaggle.com/code/batzner/gini-coefficient-an-intuitive-explanation. Using clear plots and some toy examples, Kilian tries to make sense of a not-so-common metric, yet routinely used by the actuarial departments of insurance companies.

The metric can be approximated by the ROC-AUC score or the Mann–Whitney U non-parametric statistical test (since the U statistic is equivalent to the area under the receiver operating characteristic curve – AUC) because it approximately corresponds to 2 * ROC-AUC - 1. Hence, maximizing the ROC-AUC is the same as maximizing the normalized Gini coefficient (for a reference see the Relation to other statistical measures section in the Wikipedia entry: https://en.wikipedia.org/wiki/Gini_coefficient).

The metric can also be approximately expressed as the covariance of scaled prediction rank and scaled target value, resulting in a more understandable rank association measure (see Dmitriy Guller: https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/40576).

From the point of view of the objective function, you can optimize for the binary log-loss (as you would do in a classification problem). Neither ROC-AUC nor the normalized Gini coefficient is differentiable, and they may be used only for metric evaluation on the validation set (for instance, for early stopping or for reducing the learning rate in a neural network). However, optimizing for the log-loss does not always improve the ROC-AUC and the normalized Gini coefficients and neither of them is directly differentiable.

There is actually a differentiable ROC-AUC approximation. You can read about how it works in Toon Calders, and Szymon Jaroszewicz Efficient AUC Optimization for Classification. European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, 2007: https://link.springer.com/content/pdf/10.1007/978-3-540-74976-9_8.pdf.

However, it seems that it is not necessary to use anything different from log-loss as an objective function and ROC-AUC or normalized Gini coefficient as an evaluation metric in the competition.

There are actually a few Python implementations for computing the normalized Gini coefficient among the Kaggle Notebooks. We have used here and suggest the work by CPMP (https://www.kaggle.com/code/cpmpml/extremely-fast-gini-computation/notebook) that uses Numba for speeding up computations: it is both exact and fast.

Exercise 2

In chapter 5 of The Kaggle Book (page 95 onward), we explained how to deal with competition metrics, especially if they are new and generally unknown.

As an exercise, can you find out how many competitions on Kaggle have used the normalized Gini coefficient as an evaluation metric?

Exercise Notes (write down any notes or workings that will help you):