Book Image

The Kaggle Workbook

By : Konrad Banachewicz, Luca Massaron
5 (1)
Book Image

The Kaggle Workbook

5 (1)
By: Konrad Banachewicz, Luca Massaron

Overview of this book

More than 80,000 Kaggle novices currently participate in Kaggle competitions. To help them navigate the often-overwhelming world of Kaggle, two Grandmasters put their heads together to write The Kaggle Book, which made plenty of waves in the community. Now, they’ve come back with an even more practical approach based on hands-on exercises that can help you start thinking like an experienced data scientist. In this book, you’ll get up close and personal with four extensive case studies based on past Kaggle competitions. You’ll learn how bright minds predicted which drivers would likely avoid filing insurance claims in Brazil and see how expert Kagglers used gradient-boosting methods to model Walmart unit sales time-series data. Get into computer vision by discovering different solutions for identifying the type of disease present on cassava leaves. And see how the Kaggle community created predictive algorithms to solve the natural language processing problem of subjective question-answering. You can use this workbook as a supplement alongside The Kaggle Book or on its own alongside resources available on the Kaggle website and other online communities. Whatever path you choose, this workbook will help make you a formidable Kaggle competitor.
Table of Contents (7 chapters)

Understanding the competition and the data

Porto Seguro is the third largest insurance company in Brazil (it operates in Brazil and Uruguay), offering car insurance coverage as well as many other insurance products, having used analytical methods and machine learning for the past 20 years to tailor their prices and make auto insurance coverage more accessible to more drivers. To explore new ways to achieve their task, they sponsored a competition (, expecting Kagglers to come up with new and better methods of solving some of their core analytical problems.

The competition is aimed at having Kagglers build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year, which is a quite common kind of task (the sponsor mentions it as a “classical challenge for insurance”). This kind of information about the probability of filing a claim can be quite precious for an insurance company. Without such a model, insurance companies may only charge a flat premium to customers irrespective of their risk, or, if they have a poorly performing model, they may charge a mismatched premium to them. Inaccuracies in profiling the customers’ risk can therefore result in charging a higher insurance cost to good drivers and reducing the price for the bad ones. The impact on the company would be two-fold: good drivers will look elsewhere for their insurance and the company’s portfolio will be overweighed with bad ones (technically, the company would have a bad loss ratio: Instead, if the company can correctly estimate the claim likelihood, they can ask for a fair price from their customers, thus increasing their market share, having more satisfied customers and a more balanced customer portfolio (better loss ratio), and managing their reserves better (the money the company sets aside for paying claims).

To do so, the sponsor provided training and test datasets, and the competition was ideal for anyone since the dataset was not very large and was very well prepared.

As stated on the page of the competition devoted to presenting the data (

Features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc).

In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target column signifies whether or not a claim was filed for that policy holder.

The data preparation for the competition was carefully conducted to avoid any leak of information, and although secrecy has been maintained about the meaning of the features, it is quite clear that the different used tags refer to specific kinds of features commonly used in motor insurance modeling:

  • ind refers to “individual characteristics”
  • car refers to “car characteristics”
  • calc refers to “calculated features”
  • reg refers to “regional/geographic features”

As for the individual features, there was much speculation about their meaning during the competition. See for instance:

In spite of all these and more efforts, in the end the meaning of most of the features has remained a mystery up until now.

The interesting facts about this competition are that:

  1. The data is real-world, though the features are anonymous.
  2. The data is very well prepared, without leakages of any sort (no magic features here – a magic feature is a feature that by skillful processing can provide high predictive power to your models in a Kaggle competition).
  3. The test dataset not only holds the same categorical levels as the training dataset; it also seems to be from the same distribution, although Yuya Yamamoto argues that preprocessing the data with t-SNE leads to a failing adversarial validation test (

Exercise 1

As a first exercise, referring to the contents and the code in The Kaggle Book related to adversarial validation (starting from page 179), prove that the training and test data most probably originated from the same data distribution.

Exercise Notes (write down any notes or workings that will help you):

An interesting post by Tilii (Mensur Dlakic, Associate Professor at Montana State University: demonstrates using t-SNE that “there are many people who are very similar in terms of their insurance parameters, yet some of them will file a claim and others will not.” What Tilii mentions is quite typical of what happens in insurance, where for certain priors (insurance parameters) there is the same probability of something happening, but that event will happen or not based on how long we observe the sequence of events.

Take, for instance, IoT and telematic data in insurance. It is quite common to analyze a driver’s behavior to predict if they will file a claim in the future. If your observation period is too short (for instance, one year, as in the case of this competition), it may happen that even very bad drivers won’t have a claim because there is a low probability that such an event will occur in a short period of time, even for a bad driver. Similar ideas are discussed by Andy Harless (, who argues instead that the real task of the competition is to guess "the value of a latent continuous variable that determines which drivers are more likely to have accidents" because actually "making a claim is not a characteristic of a driver; it’s a result of chance."