Book Image

The Kaggle Book

By : Konrad Banachewicz, Luca Massaron
5 (2)
Book Image

The Kaggle Book

5 (2)
By: Konrad Banachewicz, Luca Massaron

Overview of this book

Millions of data enthusiasts from around the world compete on Kaggle, the most famous data science competition platform of them all. Participating in Kaggle competitions is a surefire way to improve your data analysis skills, network with an amazing community of data scientists, and gain valuable experience to help grow your career. The first book of its kind, The Kaggle Book assembles in one place the techniques and skills you’ll need for success in competitions, data science projects, and beyond. Two Kaggle Grandmasters walk you through modeling strategies you won’t easily find elsewhere, and the knowledge they’ve accumulated along the way. As well as Kaggle-specific tips, you’ll learn more general techniques for approaching tasks based on image, tabular, textual data, and reinforcement learning. You’ll design better validation schemes and work more comfortably with different evaluation metrics. Whether you want to climb the ranks of Kaggle, build some more data science skills, or improve the accuracy of your existing models, this book is for you. Plus, join our Discord Community to learn along with more than 1,000 members and meet like-minded people!
Table of Contents (20 chapters)
Preface
1
Part I: Introduction to Competitions
6
Part II: Sharpening Your Skills for Competitions
15
Part III: Leveraging Competitions for Your Career
18
Other Books You May Enjoy
19
Index

The rise of data science competition platforms

Competitive programming has a long history, starting in the 1970s with the first iterations of the ICPC, the International Collegiate Programming Contest. In the original ICPC, small teams from universities and companies participated in a competition that required solving a series of problems using a computer program (at the beginning, participants coded in FORTRAN). In order to achieve a good final rank, teams had to display good skills in team working, problem solving, and programming.

The experience of participating in the heat of such a competition and the opportunity to stand in a spotlight for recruiting companies provided the students with ample motivation and it made the competition popular for many years. Among ICPC finalists, a few have become renowned: there is Adam D’Angelo, the former CTO of Facebook and founder of Quora, Nikolai Durov, the co-founder of Telegram Messenger, and Matei Zaharia, the creator of Apache Spark. Together with many other professionals, they all share the same experience: having taken part in an ICPC.

After the ICPC, programming competitions flourished, especially after 2000 when remote participation became more feasible, allowing international competitions to run more easily and at a lower cost. The format is similar for most of these competitions: there is a series of problems and you have to code a solution to solve them. The winners are given a prize, but also make themselves known to recruiting companies or simply become famous.

Typically, problems in competitive programming range from combinatorics and number theory to graph theory, algorithmic game theory, computational geometry, string analysis, and data structures. Recently, problems relating to artificial intelligence have successfully emerged, in particular after the launch of the KDD Cup, a contest in knowledge discovery and data mining, held by the Association for Computing Machinery’s (ACM’s) Special Interest Group (SIG) during its annual conference (https://kdd.org/conferences).

The first KDD Cup, held in 1997, involved a problem about direct marketing for lift curve optimization and it started a long series of competitions that continues today. You can find the archives containing datasets, instructions, and winners at https://www.kdd.org/kdd-cup. Here is the latest available at the time of writing: https://ogb.stanford.edu/kddcup2021/. KDD Cups proved quite effective in establishing best practices, with many published papers describing solutions, techniques, and competition dataset sharing, which have been useful for many practitioners for experimentation, education, and benchmarking.

The successful examples of both competitive programming events and the KDD Cup inspired companies (such as Netflix) and entrepreneurs (such as Anthony Goldbloom, the founder of Kaggle) to create the first data science competition platforms, where companies can host data science challenges that are hard to solve and might benefit from crowdsourcing. In fact, given that there is no golden approach that works for all the problems in data science, many problems require a time-consuming approach that can be summed up as try all that you can try.

In fact, in the long run, no algorithm can beat all the others on all problems, as stated by the No Free Lunch theorem by David Wolpert and William Macready. The theorem tells you that each machine learning algorithm performs if and only if its hypothesis space comprises the solution. Consequently, as you cannot know beforehand if a machine learning algorithm can best tackle your problem, you have to try it, testing it directly on your problem before being assured that you are doing the right thing. There are no theoretical shortcuts or other holy grails of machine learning – only empirical experimentation can tell you what works.

For more details, you can look up the No Free Lunch theorem for a theoretical explanation of this practical truth. Here is a complete article from Analytics India Magazine on the topic: https://analyticsindiamag.com/what-are-the-no-free-lunch-theorems-in-data-science/.

Crowdsourcing proves ideal in such conditions where you need to test algorithms and data transformations extensively to find the best possible combinations, but you lack the manpower and computer power for it. That’s why, for instance, governments and companies resort to competitions in order to advance in certain fields:

  • On the government side, we can quote DARPA and its many competitions surrounding self-driving cars, robotic operations, machine translation, speaker identification, fingerprint recognition, information retrieval, OCR, automatic target recognition, and many others.
  • On the business side, we can quote a company such as Netflix, which entrusted the outcome of a competition to improve its algorithm for predicting user movie selection.

The Netflix competition was based on the idea of improving existing collaborative filtering. The purpose of this was simply to predict the potential rating a user would give a film, solely based on the ratings that they gave other films, without knowing specifically who the user was or what the films were. Since no user description or movie title or description were available (all being replaced with identity codes), the competition required entrants to develop smart ways to use the past ratings available. The grand prize of US $1,000,000 was to be awarded only if the solution could improve the existing Netflix algorithm, Cinematch, above a certain threshold.

The competition ran from 2006 to 2009 and saw victory for a team made up of the fusion of many previous competition teams: a team from Commendo Research & Consulting GmbH, Andreas Töscher and Michael Jahrer, quite renowned also in Kaggle competitions; two researchers from AT&T Labs; and two others from Yahoo!. In the end, winning the competition required so much computational power and the ensembling of different solutions that teams were forced to merge in order to keep pace. This situation was also reflected in the actual usage of the solution by Netflix, who preferred not to implement it, but simply took the most interesting insight from it in order to improve its existing Cinematch algorithm. You can read more about it in this Wired article: https://www.wired.com/2012/04/netflix-prize-costs/.

At the end of the Netflix competition, what mattered was not the solution per se, which was quickly superseded by the change in business focus of Netflix from DVDs to online movies. The real benefit for both the participants, who gained a huge reputation in collaborative filtering, and the company, who could transfer its improved recommendation knowledge to its new business, were the insights that were gained from the competition.

The Kaggle competition platform

Companies other than Netflix have also benefitted from data science competitions. The list is long, but we can quote a few examples where the company running the competition reported a clear benefit from it. For instance:

  • The insurance company Allstate was able to improve its actuarial models built by their own experts, thanks to a competition involving hundreds of data scientists (https://www.kaggle.com/c/ClaimPredictionChallenge)
  • As another well-documented example, General Electric was able to improve by 40% on the industry-standard performance (measured by the root mean squared error metric) for predicting arrival times of airline flights, thanks to a similar competition (https://www.kaggle.com/c/flight)

The Kaggle competition platform has to this day held hundreds of competitions, and these two are just a couple of examples of companies that used them successfully. Let’s take a step back from specific competitions for a moment and talk about the Kaggle company, which is the common thread through this book.

A history of Kaggle

Kaggle took its first steps in February 2010, thanks to Anthony Goldbloom, an Australian trained economist with a degree in Economics and Econometrics. After working at Australia’s Department of the Treasury and the Research department at the Reserve Bank of Australia, Goldbloom interned in London at The Economist, the international weekly newspaper on current affairs, international business, politics, and technology. At The Economist, he had occasion to write an article about big data, which inspired his idea to build a competition platform that could crowdsource the best analytical experts to solve interesting machine learning problems (https://www.smh.com.au/technology/from-bondi-to-the-big-bucks-the-28yearold-whos-making-data-science-a-sport-20111104-1myq1.html). Since the crowdsourcing dynamics played a relevant part in the business idea for this platform, he derived the name Kaggle, which recalls by rhyme the term gaggle, a flock of geese, the goose also being the symbol of the platform.

After moving to Silicon Valley in the USA, his Kaggle start-up received $11.25 million in Series A funding from a round led by Khosla Ventures and Index Ventures, two renowned venture capital firms. The first competitions were rolled out, the community grew, and some of the initial competitors came to be quite prominent, such as Jeremy Howard, the Australian data scientist and entrepreneur, who, after winning a couple of competitions on Kaggle, became the President and Chief Scientist of the company.

Jeremy Howard left his position as President in December 2013 and established a new start-up, fast.ai (www.fast.ai), offering machine learning courses and a deep learning library for coders.

At the time, there were some other prominent Kagglers (the name indicating frequent participants of competitions held by Kaggle) such as Jeremy Achin and Thomas de Godoy. After reaching the top 20 global rankings on the platform, they promptly decided to retire and to found their own company, DataRobot. Soon after, they started hiring their employees from among the best participants in the Kaggle competitions in order to instill the best machine learning knowledge and practices into the software they were developing. Today, DataRobot is one of the leading companies in developing AutoML solutions (software for automatic machine learning).

The Kaggle competitions claimed more and more attention from a growing audience. Even Geoffrey Hinton, the “godfather” of deep learning, participated in (and won) a Kaggle competition hosted by Merck in 2012 (https://www.kaggle.com/c/MerckActivity/overview/winners). Kaggle was also the platform where François Chollet launched his deep learning package Keras during the Otto Group Product Classification Challenge (https://www.kaggle.com/c/otto-group-product-classification-challenge/discussion/13632) and Tianqi Chen launched XGBoost, a speedier and more accurate version of gradient boosting machines, in the Higgs Boson Machine Learning Challenge (https://www.kaggle.com/c/higgs-boson/discussion/10335).

Besides Keras, François Chollet has also provided the most useful and insightful perspective on how to win a Kaggle competition in an answer of his on the Quora website: https://www.quora.com/Why-has-Keras-been-so-successful-lately-at-Kaggle-competitions.

Fast iterations of multiple attempts, guided by empirical (more than theoretical) evidence, are actually all that you need. We don’t think that there are many more secrets to winning a Kaggle competition than the ones he pointed out in his answer.

Notably, François Chollet also hosted his own competition on Kaggle (https://www.kaggle.com/c/abstraction-and-reasoning-challenge/), which is widely recognized as being the first general AI competition in the world.

Competition after competition, the community revolving around Kaggle grew to touch one million in 2017, the same year as, during her keynote at Google Next, Fei-Fei Li, Chief Scientist at Google, announced that Google Alphabet was going to acquire Kaggle. Since then, Kaggle has been part of Google.

Today, the Kaggle community is still active and growing. In a tweet of his (https://twitter.com/antgoldbloom/status/1400119591246852096), Anthony Goldbloom reported that most of its users, other than participating in a competition, have downloaded public data (Kaggle has become an important data hub), created a public Notebook in Python or R, or learned something new in one of the courses offered:

Figure 1.1: A bar chart showing how users used Kaggle in 2020, 2019, and 2018

Through the years, Kaggle has offered many of its participants even more opportunities, such as:

  • Creating their own company
  • Launching machine learning software and packages
  • Getting interviews in magazines (https://www.wired.com/story/solve-these-tough-data-problems-and-watch-job-offers-roll-in/)
  • Writing machine learning books (https://twitter.com/antgoldbloom/status /745662719588589568)
  • Finding their dream job

And, most importantly, learning more about the skills and technicalities involved in data science.

Other competition platforms

Though this book focuses on competitions on Kaggle, we cannot forget that many data competitions are held on private platforms or on other competition platforms. In truth, most of the information you will find in this book will also hold for other competitions, since they essentially all operate under similar principles and the benefits for the participants are more or less the same.

Although many other platforms are localized in specific countries or are specialized only for certain kinds of competitions, for completeness we will briefly introduce some of them, at least those we have some experience and knowledge of:

  • DrivenData (https://www.drivendata.org/competitions/) is a crowdsourcing competition platform devoted to social challenges (see https://www.drivendata.co/blog/intro-to-machine-learning-social-impact/). The company itself is a social enterprise whose aim is to bring data science solutions to organizations tackling the world’s biggest challenges, thanks to data scientists building algorithms for social good. For instance, as you can read in this article, https://www.engadget.com/facebook-ai-hate-speech-covid-19-160037191.html, Facebook has chosen DrivenData for its competition on building models against hate speech and misinformation.
  • Numerai (https://numer.ai/) is an AI-powered, crowdsourced hedge fund based in San Francisco. It hosts a weekly tournament in which you can submit your predictions on hedge fund obfuscated data and earn your prizes in the company’s cryptocurrency, Numeraire.
  • CrowdANALYTIX (https://www.crowdanalytix.com/community) is a bit less active now, but this platform used to host quite a few challenging competitions a short while ago, as you can read from this blog post: https://towardsdatascience.com/how-i-won-top-five-in-a-deep-learning-competition-753c788cade1. The community blog is quite interesting for getting an idea of what challenges you can find on this platform: https://www.crowdanalytix.com/jq/communityBlog/listBlog.html.
  • Signate (https://signate.jp/competitions) is a Japanese data science competition platform. It is quite rich in contests and it offers a ranking system similar to Kaggle’s (https://signate.jp/users/rankings).
  • Zindi (https://zindi.africa/competitions) is a data science competition platform from Africa. It hosts competitions focused on solving Africa’s most pressing social, economic, and environmental problems.
  • Alibaba Cloud (https://www.alibabacloud.com/campaign/tianchi-competitions) is a Chinese cloud computer and AI provider that has launched the Tianchi Academic competitions, partnering with academic conferences such as SIGKDD, IJCAI-PRICAI, and CVPR and featuring challenges such as image-based 3D shape retrieval, 3D object reconstruction, and instance segmentation.
  • Analytics Vidhya (https://datahack.analyticsvidhya.com/) is the largest Indian community for data science, offering a platform for data science hackathons.
  • CodaLab (https://codalab.lri.fr/) is a French-based data science competition platform, created as a joint venture between Microsoft and Stanford University in 2013. They feature a free cloud-based notebook called Worksheets (https://worksheets.codalab.org/) for knowledge sharing and reproducible modeling.

Other minor platforms are CrowdAI (https://www.crowdai.org/) from École Polytechnique Fédérale de Lausanne in Switzerland, InnoCentive (https://www.innocentive.com/), Grand-Challenge (https://grand-challenge.org/) for biomedical imaging, DataFountain (https://www.datafountain.cn/business?lang=en-US), OpenML (https://www.openml.org/), and the list could go on. You can always find a large list of ongoing major competitions at the Russian community Open Data Science (https://ods.ai/competitions) and even discover new competition platforms from time to time.

You can see an overview of running competitions on the mlcontests.com website, along with the current costs for renting GPUs. The website is often updated and it is an easy way to get a glance at what’s going on with data science competitions across different platforms.

Kaggle is always the best platform where you can find the most interesting competitions and obtain the widest recognition for your competition efforts. However, picking up a challenge outside of it makes sense, and we recommend it as a strategy, when you find a competition matching your personal and professional interests. As you can see, there are quite a lot of alternatives and opportunities besides Kaggle, which means that if you consider more competition platforms alongside Kaggle, you can more easily find a competition that might interest you because of its specialization or data.

In addition, you can expect less competitive pressure during these challenges (and consequently a better ranking or even winning something), since they are less known and advertised. Just expect less sharing among participants, since no other competition platform has reached the same richness of sharing and networking opportunities as Kaggle.