Book Image

The Kaggle Workbook

By : Konrad Banachewicz, Luca Massaron
5 (1)
Book Image

The Kaggle Workbook

5 (1)
By: Konrad Banachewicz, Luca Massaron

Overview of this book

More than 80,000 Kaggle novices currently participate in Kaggle competitions. To help them navigate the often-overwhelming world of Kaggle, two Grandmasters put their heads together to write The Kaggle Book, which made plenty of waves in the community. Now, they’ve come back with an even more practical approach based on hands-on exercises that can help you start thinking like an experienced data scientist. In this book, you’ll get up close and personal with four extensive case studies based on past Kaggle competitions. You’ll learn how bright minds predicted which drivers would likely avoid filing insurance claims in Brazil and see how expert Kagglers used gradient-boosting methods to model Walmart unit sales time-series data. Get into computer vision by discovering different solutions for identifying the type of disease present on cassava leaves. And see how the Kaggle community created predictive algorithms to solve the natural language processing problem of subjective question-answering. You can use this workbook as a supplement alongside The Kaggle Book or on its own alongside resources available on the Kaggle website and other online communities. Whatever path you choose, this workbook will help make you a formidable Kaggle competitor.
Table of Contents (7 chapters)

Learnings from top solutions

In this section we gather aspects from the top solutions that could allow us to rise above the level of the baseline solution. Keep in mind that the leaderboard (both public and private) in this competition were quite tight; this was a combination of a few factors:

  • The noisy data - it was easy to get to .89 accuracy by correctly identifying large part of the train data, and then each new correct one allowed for a tiny move upward
  • The metric - accuracy can be tricky to ensemble
  • Limited size of the data

Pretraining

First and most obvious remedy to the issue of limited data size was pretraining: using more data. The Cassava competition was held a year before as well:

https://www.kaggle.com/competitions/cassava-disease/overview

With minimal adjustments, the data from the 2019 edition could be leveraged in the context of the current one. Several competitors addressed the topic:

  • Combined 2019 + 2020 dataset in TFRecords format was released in the forum: https...