Book Image

The Kaggle Book

By : Konrad Banachewicz, Luca Massaron
5 (2)
Book Image

The Kaggle Book

5 (2)
By: Konrad Banachewicz, Luca Massaron

Overview of this book

Millions of data enthusiasts from around the world compete on Kaggle, the most famous data science competition platform of them all. Participating in Kaggle competitions is a surefire way to improve your data analysis skills, network with an amazing community of data scientists, and gain valuable experience to help grow your career. The first book of its kind, The Kaggle Book assembles in one place the techniques and skills you’ll need for success in competitions, data science projects, and beyond. Two Kaggle Grandmasters walk you through modeling strategies you won’t easily find elsewhere, and the knowledge they’ve accumulated along the way. As well as Kaggle-specific tips, you’ll learn more general techniques for approaching tasks based on image, tabular, textual data, and reinforcement learning. You’ll design better validation schemes and work more comfortably with different evaluation metrics. Whether you want to climb the ranks of Kaggle, build some more data science skills, or improve the accuracy of your existing models, this book is for you. Plus, join our Discord Community to learn along with more than 1,000 members and meet like-minded people!
Table of Contents (20 chapters)
Preface
1
Part I: Introduction to Competitions
6
Part II: Sharpening Your Skills for Competitions
15
Part III: Leveraging Competitions for Your Career
18
Other Books You May Enjoy
19
Index

Preface

Having competed on Kaggle for over ten years, both of us have experienced highs and lows over many competitions. We often found ourselves refocusing our efforts on different activities relating to Kaggle. Over time, we devoted ourselves not just to competitions but also to creating content and code based on the demands of the data science market and our own professional aspirations. At this point in our journey, we felt that our combined experience and still-burning passion for competitions could really help other participants who have just started, or who would like to get inspired, to get hold of the essential expertise they need, so they can start their own journey in data science competitions.

We then decided to work on this book with a purpose:

  • To offer, in a single place, the best tips for being competitive and approaching most of the problems you may find when participating on Kaggle and also other data science competitions.
  • To offer enough suggestions to allow anyone to reach at least the Expert level in any Kaggle discipline: Competitions, Datasets, Notebooks, or Discussions.
  • To provide tips on how to learn the most from Kaggle and leverage this experience for professional growth in data science.
  • To gather in a single source the largest number of perspectives on the experience of participating in competitions, by interviewing Kaggle Masters and Grandmasters and listening to their stories.

In short, we have written a book that demonstrates how to participate in competitions successfully and make the most of all the opportunities that Kaggle offers. The book is also intended as a practical reference that saves you time and effort, through its selection of many competition tips and tricks that are hard to learn about and find on the internet or on Kaggle forums. Nevertheless, the book doesn’t limit itself to providing practical help; it also aspires to help you figure out how to boost your career in data science by participating in competitions.

Please be aware: this book doesn’t teach you data science from the basics. We don’t explain in detail how linear regression or random forests or gradient boosting work, but how to use them in the best way and obtain the best results from them in a data problem. We expect solid foundations and at least a basic proficiency in data science topics and Python usage from our readers. If you are still a data science beginner, you need to supplement this book with other books on data science, machine learning, and deep learning, and train up on online courses, such as those offered by Kaggle itself or by MOOCs such as edX or Coursera.

If you want to start learning data science in a practical way, if you want to challenge yourself with tricky and intriguing data problems and simultaneously build a network of great fellow data scientists as passionate about their work in data as you are, this is indeed the book for you. Let’s get started!

Who this book is for

At the time of completion of this book, there are 96,190 Kaggle novices (users who have just registered on the website) and 67,666 Kaggle contributors (users who have just filled in their profile) enlisted in Kaggle competitions. This book has been written for all of them and for anyone else wanting to break the ice and start taking part in competitions on Kaggle and learning from them.

What this book covers

Part 1: Introduction to Competitions

Chapter 1, Introducing Kaggle and Other Data Science Competitions, discusses how competitive programming evolved into data science competitions. It explains why the Kaggle platform is the most popular site for these competitions and provides you with an idea about how it works.

Chapter 2, Organizing Data with Datasets, introduces you to Kaggle Datasets, the standard method of data storage on the platform. We discuss setup, gathering data, and utilizing it in your work on Kaggle.

Chapter 3, Working and Learning with Kaggle Notebooks, discusses Kaggle Notebooks, the baseline coding environment. We talk about the basics of Notebook usage, as well as how to leverage the GCP environment, and using them to build up your data science portfolio.

Chapter 4, Leveraging Discussion Forums, allows you to familiarize yourself with discussion forums, the primary manner of communication and idea exchange on Kaggle.

Part 2: Sharpening Your Skills for Competitions

Chapter 5, Competition Tasks and Metrics, details how evaluation metrics for certain kinds of problems strongly influence the way you can operate when building your model solution in a data science competition. The chapter also addresses the large variety of metrics available in Kaggle competitions.

Chapter 6, Designing Good Validation, will introduce you to the importance of validation in data competitions, discussing overfitting, shake-ups, leakage, adversarial validation, different kinds of validation strategies, and strategies for your final submissions.

Chapter 7, Modeling for Tabular Competitions, discusses tabular competitions, mostly focusing on the more recent reality of Kaggle, the Tabular Playground Series. Tabular problems are standard practice for the majority of data scientists around and there is a lot to learn from Kaggle.

Chapter 8, Hyperparameter Optimization, explores how to extend the cross-validation approach to find the best hyperparameters for your models – in other words, those that can generalize in the best way on the private leaderboard – under the pressure and scarcity of time and resources that you experience in Kaggle competitions.

Chapter 9, Ensembling with Blending and Stacking Solutions, explains ensembling techniques for multiple models such as averaging, blending, and stacking. We will provide you with some theory, some practice, and some code examples you can use as templates when building your own solutions on Kaggle.

Chapter 10, Modeling for Computer Vision, we discuss problems related to computer vision, one of the most popular topics in AI in general, and on Kaggle specifically. We demonstrate full pipelines for building solutions to challenges in image classification, object detection, and image segmentation.

Chapter 11, Modeling for NLP, focuses on the frequently encountered types of Kaggle challenges related to natural language processing. We demonstrate how to build an end-to-end solution for popular problems like open domain question answering.

Chapter 12, Simulation and Optimization Competitions, provides an overview of simulation competitions, a new class of contests gaining popularity on Kaggle over the last few years.

Part 3: Leveraging Competitions for Your Career

Chapter 13, Creating Your Portfolio of Projects and Ideas, explores ways you can stand out by showcasing your work on Kaggle itself and other sites in an appropriate way.

Chapter 14, Finding New Professional Opportunities, concludes the overview of how Kaggle can positively affect your career by discussing the best ways to leverage all your Kaggle experience in order to find new professional opportunities.

To get the most out of this book

The Python code in this book has been designed to be run on a Kaggle Notebook, without any installation on a local computer. Therefore, don’t worry about what machine you have available or what version of Python packages you should install.

All you need is a computer with access to the internet and a free Kaggle account. In fact, to run the code on a Kaggle Notebook (you will find instructions about the procedure in Chapter 3), you first need to open an account on Kaggle. If you don’t have one yet, just go to www.kaggle.com and follow the instructions on the website.

We link out to many different resources throughout the book that we think you will find useful. When referred to a link, explore it: you will find code available on public Kaggle Notebooks that you can reuse, or further materials to illustrate concepts and ideas that we have discussed in the book.

Download the example code files

The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/The-Kaggle-Book. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801817479_ColorImages.pdf.

Conventions used

There are a few text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example; “ The dataset will be downloaded to the Kaggle folder as a .zip archive – unpack it and you are good to go.”

A block of code is set as follows:

from google.colab import drive
drive.mount('/content/gdrive')

Any command-line input or output is written as follows:

I genuinely have no idea what the output of this sequence of words will be - it will be interesting to find out what nlpaug can do with this!

Bold: Indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes. For example: “ The specific limits at the time of writing are 100 GB per private dataset and a 100 GB total quota.”

Further notes, references, and links to useful places appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected], and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Share your thoughts

Once you’ve read The Kaggle Book, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

  1. Scan the QR code or visit the link below

    https://packt.link/free-ebook/9781801817479

  2. Submit your proof of purchase
  3. That’s it! We’ll send your free PDF and other benefits to your email directly