Book Image

scikit-learn Cookbook

By : Trent Hauck
Book Image

scikit-learn Cookbook

By: Trent Hauck

Overview of this book

<p>Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. Its consistent API and plethora of features help solve any machine learning problem it comes across.</p> <p>The book starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets.</p>
Table of Contents (12 chapters)
scikit-learn Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Preface

This book is designed in the same way that many data science and analytics projects play out. First, we need to acquire data; the data is often messy, incomplete, or not correct in some way. Therefore, we spend the first chapter talking about strategies for dealing with bad data and ways to deal with other problems that arise from data. For example, what happens if we have too many features? How do we handle that? The first chapter is your guide. The meat of the book will walk you through various algorithms and how to implement them into your workflow. And finally, we'll end with the postmodel workflow. This chapter is fairly agnostic to the other chapters and can be applied to the various algorithms you'll learn up until the final chapter.

What this book covers

Chapter 1, Premodel Workflow, walks you through the preparatory step of preparing a dataset for modeling and shows how scikit-learn can help to ameliorate the burden of preprocessing.

Chapter 2, Working with Linear Models, discusses how many problems can be viewed as linear models upon the appropriate application of a transformation, and therefore walks you through what may be the most used class of models.

Chapter 3, Building Models with Distance Metrics, encompasses a large number of topics that largely work by measuring the similarity between the data points. Because similarity and distance are often synonymous, clustering can often be used as long as a distance function can be defined.

Chapter 4, Classifying Data with scikit-learn, focuses on the various methods within scikit-learn that are used to determine a data point as some member between 1 and N classes.

Chapter 5, Postmodel Workflow, teaches us how we can take a basic model produced from one of the recipes and tune it so that we can achieve better results than we could with the basic model.

What you need for this book

Here are the contents of the requirements.txt file that will get the environment set up. This will allow you to follow along with the code in the book.

I've also included a conda requirements file; this method may be easier for less-experienced Python developers:

dateutil==2.1
ipython==2.2.0
ipython-notebook==2.1.0
jinja2==2.7.3
markupsafe==0.18
matplotlib==1.3.1
numpy==1.8.1
patsy==0.3.0
pandas==0.14.1
pip==1.5.6
pydot==1.0.28
pyparsing==1.5.6
pytz==2014.4
pyzmq==14.3.1
scikit-learn==0.15.0
scipy==0.14.0
setuptools==3.6
six==1.7.3
ssl_match_hostname==3.4.0.2
tornado==3.2.2

Who this book is for

This book can help budding analysts who are familiar with Python to take the next step into machine learning with scikit-learn. It is assumed that you are familiar with Python, but beyond that we'll touch on many of the important aspects of scikit-learn. On top of that, we'll discuss enough theory to help you ask the next question after you've figured out the nuances of scikit-learn.

Sections

This book contains the following sections:

Getting ready

This section tells us what to expect in the recipe, and describes how to set up any software or any preliminary settings needed for the recipe.

How to do it…

This section characterizes the steps to be followed for "cooking" the recipe.

How it works…

This section usually consists of a brief and detailed explanation of what happened in the previous section.

There's more…

This consists of additional information about the recipe in order to make the reader more anxious about the recipe.

See also

This section may contain references to the recipe.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "From within IPython, run datasets.*?, which will list everything available within the datasets module."

Any command-line input or output is written as follows:

>>> transformed = dl.fit_transform(iris_data[::2])
>>> transformed[:5]

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Notice the peak around 0. This will naturally lead to the zero coefficients in lasso regression."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Downloading the color images of this book

We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from: https://www.packtpub.com/sites/default/files/downloads/9485OS_GraphicsBundle.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.