Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying The Kaggle Workbook
  • Table Of Contents Toc
  • Feedback & Rating feedback
The Kaggle Workbook

The Kaggle Workbook

By : Konrad Banachewicz, Luca Massaron
4.8 (25)
close
close
The Kaggle Workbook

The Kaggle Workbook

4.8 (25)
By: Konrad Banachewicz, Luca Massaron

Overview of this book

More than 80,000 Kaggle novices currently participate in Kaggle competitions. To help them navigate the often-overwhelming world of Kaggle, two Grandmasters put their heads together to write The Kaggle Book, which made plenty of waves in the community. Now, they’ve come back with an even more practical approach based on hands-on exercises that can help you start thinking like an experienced data scientist. In this book, you’ll get up close and personal with four extensive case studies based on past Kaggle competitions. You’ll learn how bright minds predicted which drivers would likely avoid filing insurance claims in Brazil and see how expert Kagglers used gradient-boosting methods to model Walmart unit sales time-series data. Get into computer vision by discovering different solutions for identifying the type of disease present on cassava leaves. And see how the Kaggle community created predictive algorithms to solve the natural language processing problem of subjective question-answering. You can use this workbook as a supplement alongside The Kaggle Book or on its own alongside resources available on the Kaggle website and other online communities. Whatever path you choose, this workbook will help make you a formidable Kaggle competitor.
Table of Contents (7 chapters)
close
close

Understanding the evaluation metric

The metric used in the competition is the normalized Gini coefficient (named after the similar Gini coefficient/index used in economics), which has been previously used in another competition, the Allstate Claim Prediction Challenge (https://www.kaggle.com/competitions/ClaimPredictionChallenge). From that competition, we can get a very clear explanation of what this metric is about:

When you submit an entry, the observations are sorted from “largest prediction” to “smallest prediction.” This is the only step where your predictions come into play, so only the order determined by your predictions matters. Visualize the observations arranged from left to right, with the largest predictions on the left. We then move from left to right, asking “In the leftmost x% of the data, how much of the actual observed loss have you accumulated?” With no model, you can expect to accumulate 10% of the loss in 10% of the predictions, so no model (or a “null” model) achieves a straight line. We call the area between your curve and this straight line the Gini coefficient.

There is a maximum achievable area for a “perfect” model. We will use the normalized Gini coefficient by dividing the Gini coefficient of your model by the Gini coefficient of the perfect model.

There is no formulation proposed by the organizers of the competition for the Normalized Gini apart from this verbose description, but by reading the notebook from Mohsin Hasan (https://www.kaggle.com/code/tezdhar/faster-gini-calculation/notebook), we can figure out that it is calculated in two steps and can obtain some easy to understand pseudocode that reveals its inner workings. First, you get the Gini coefficient for your predictions, then you normalize it by dividing it by another Gini coefficient computed by pretending you have perfect predictions. Here is the pseudocode for the Gini coefficient:

order = indexes of sorted predictions (expressed as probabilities from lowest to highest)

sorted_actual = actual[order] = ground truth values sorted based on indexes of sorted predictions

cumsum_sorted_actual = cumulated sum of the sorted ground truth values

n = number of predictions

gini_coef = (sum(cumsum_sorted_actual ) / sum(sorted_actual ) - (n + 1) / 2) / n

Once you have the Gini coefficient for your predictions, you need to divide it by the Gini coefficient you compute using the ground truth values as they were your predictions (the case of having perfect predictions)

norm_gini_coef = gini_coef(predictions) / gini_coef(ground truth)

Another good explanation is provided in the notebook by Kilian Batzner: https://www.kaggle.com/code/batzner/gini-coefficient-an-intuitive-explanation. Using clear plots and some toy examples, Kilian tries to make sense of a not-so-common metric, yet routinely used by the actuarial departments of insurance companies.

The metric can be approximated by the ROC-AUC score or the Mann–Whitney U non-parametric statistical test (since the U statistic is equivalent to the area under the receiver operating characteristic curve – AUC) because it approximately corresponds to 2 * ROC-AUC - 1. Hence, maximizing the ROC-AUC is the same as maximizing the normalized Gini coefficient (for a reference see the Relation to other statistical measures section in the Wikipedia entry: https://en.wikipedia.org/wiki/Gini_coefficient).

The metric can also be approximately expressed as the covariance of scaled prediction rank and scaled target value, resulting in a more understandable rank association measure (see Dmitriy Guller: https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/40576).

From the point of view of the objective function, you can optimize for the binary log-loss (as you would do in a classification problem). Neither ROC-AUC nor the normalized Gini coefficient is differentiable, and they may be used only for metric evaluation on the validation set (for instance, for early stopping or for reducing the learning rate in a neural network). However, optimizing for the log-loss does not always improve the ROC-AUC and the normalized Gini coefficients and neither of them is directly differentiable.

There is actually a differentiable ROC-AUC approximation. You can read about how it works in Toon Calders, and Szymon Jaroszewicz Efficient AUC Optimization for Classification. European Conference on Principles of Data Mining and Knowledge Discovery. Springer, Berlin, Heidelberg, 2007: https://link.springer.com/content/pdf/10.1007/978-3-540-74976-9_8.pdf.

However, it seems that it is not necessary to use anything different from log-loss as an objective function and ROC-AUC or normalized Gini coefficient as an evaluation metric in the competition.

There are actually a few Python implementations for computing the normalized Gini coefficient among the Kaggle Notebooks. We have used here and suggest the work by CPMP (https://www.kaggle.com/code/cpmpml/extremely-fast-gini-computation/notebook) that uses Numba for speeding up computations: it is both exact and fast.

Exercise 2

In chapter 5 of The Kaggle Book (page 95 onward), we explained how to deal with competition metrics, especially if they are new and generally unknown.

As an exercise, can you find out how many competitions on Kaggle have used the normalized Gini coefficient as an evaluation metric?

Exercise Notes (write down any notes or workings that will help you):

Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
The Kaggle Workbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon