Understanding the competition and the data
Porto Seguro is the third largest insurance company in Brazil (it operates in Brazil and Uruguay), offering car insurance coverage as well as many other insurance products, having used analytical methods and machine learning for the past 20 years to tailor their prices and make auto insurance coverage more accessible to more drivers. To explore new ways to achieve their task, they sponsored a competition (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction), expecting Kagglers to come up with new and better methods of solving some of their core analytical problems.
The competition is aimed at having Kagglers build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year, which is a quite common kind of task (the sponsor mentions it as a “classical challenge for insurance”). This kind of information about the probability of filing a claim can be quite precious for an insurance company. Without such a model, insurance companies may only charge a flat premium to customers irrespective of their risk, or, if they have a poorly performing model, they may charge a mismatched premium to them. Inaccuracies in profiling the customers’ risk can therefore result in charging a higher insurance cost to good drivers and reducing the price for the bad ones. The impact on the company would be two-fold: good drivers will look elsewhere for their insurance and the company’s portfolio will be overweighed with bad ones (technically, the company would have a bad loss ratio: https://www.investopedia.com/terms/l/loss-ratio.asp). Instead, if the company can correctly estimate the claim likelihood, they can ask for a fair price from their customers, thus increasing their market share, having more satisfied customers and a more balanced customer portfolio (better loss ratio), and managing their reserves better (the money the company sets aside for paying claims).
To do so, the sponsor provided training and test datasets, and the competition was ideal for anyone since the dataset was not very large and was very well prepared.
As stated on the page of the competition devoted to presenting the data (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/data):
Features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc).
In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target column signifies whether or not a claim was filed for that policy holder.
The data preparation for the competition was carefully conducted to avoid any leak of information, and although secrecy has been maintained about the meaning of the features, it is quite clear that the different used tags refer to specific kinds of features commonly used in motor insurance modeling:
indrefers to “individual characteristics”
carrefers to “car characteristics”
calcrefers to “calculated features”
regrefers to “regional/geographic features”
- https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/41489, where Raddar suggests that the feature
ps_car_13could represent the distance driven between bi-yearly mandatory car checkups.
- https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/41488, where Raddar suggests that the feature
ps_car_12instead represents engine car cylinder capacity.
- https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/41057, where you can read about the suggestion to attribute some feature as derived from Porto Seguro’s online quote form.
In spite of all these and more efforts, in the end the meaning of most of the features has remained a mystery up until now.
The interesting facts about this competition are that:
- The data is real-world, though the features are anonymous.
- The data is very well prepared, without leakages of any sort (no magic features here – a magic feature is a feature that by skillful processing can provide high predictive power to your models in a Kaggle competition).
- The test dataset not only holds the same categorical levels as the training dataset; it also seems to be from the same distribution, although Yuya Yamamoto argues that preprocessing the data with t-SNE leads to a failing adversarial validation test (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/44784).
As a first exercise, referring to the contents and the code in The Kaggle Book related to adversarial validation (starting from page 179), prove that the training and test data most probably originated from the same data distribution.
Exercise Notes (write down any notes or workings that will help you):
An interesting post by Tilii (Mensur Dlakic, Associate Professor at Montana State University: https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/42197) demonstrates using t-SNE that “there are many people who are very similar in terms of their insurance parameters, yet some of them will file a claim and others will not.” What Tilii mentions is quite typical of what happens in insurance, where for certain priors (insurance parameters) there is the same probability of something happening, but that event will happen or not based on how long we observe the sequence of events.
Take, for instance, IoT and telematic data in insurance. It is quite common to analyze a driver’s behavior to predict if they will file a claim in the future. If your observation period is too short (for instance, one year, as in the case of this competition), it may happen that even very bad drivers won’t have a claim because there is a low probability that such an event will occur in a short period of time, even for a bad driver. Similar ideas are discussed by Andy Harless (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/discussion/42735), who argues instead that the real task of the competition is to guess "the value of a latent continuous variable that determines which drivers are more likely to have accidents" because actually "making a claim is not a characteristic of a driver; it’s a result of chance."