Book Image

Supervised Machine Learning with Python

By : Taylor Smith
Book Image

Supervised Machine Learning with Python

By: Taylor Smith

Overview of this book

Supervised machine learning is used in a wide range of sectors, such as finance, online advertising, and analytics, to train systems to make pricing predictions, campaign adjustments, customer recommendations, and much more by learning from the data that is used to train it and making decisions on its own. This makes it crucial to know how a machine 'learns' under the hood. This book will guide you through the implementation and nuances of many popular supervised machine learning algorithms, and help you understand how they work. You’ll embark on this journey with a quick overview of supervised learning and see how it differs from unsupervised learning. You’ll then explore parametric models, such as linear and logistic regression, non-parametric methods, such as decision trees, and a variety of clustering techniques that facilitate decision-making and predictions. As you advance, you'll work hands-on with recommender systems, which are widely used by online companies to increase user interaction and enrich shopping potential. Finally, you’ll wrap up with a brief foray into neural networks and transfer learning. By the end of this book, you’ll be equipped with hands-on techniques and will have gained the practical know-how you need to quickly and effectively apply algorithms to solve new problems.
Table of Contents (11 chapters)
Title Page
Copyright and Credits
About Packt
Contributor
Preface
Index

Model evaluation and data splitting


In this chapter, we will define what it means to evaluate a model, best practices for gauging the advocacy of a model, how to split your data, and several considerations that you'll have to make when preparing your split.

It is important to understand some core best practices of machine learning. One of our primary tasks as ML practitioners is to create a model that is effective for making predictions on new data. But how do we know that a model is good? If you recall from the previous section, we defined supervised learning as simply a task that learns a function from labelled data such that we can approximate the target of the new data. Therefore, we can test our model's effectiveness. We can determine how it performs on data that is never seen—just like it's taking a test.

Out-of-sample versus in-sample evaluation

Let's say we are training a small machine which is a simple classification task. Here's some nomenclature you'll need: the in-sample data is the data the model learns from and the out-of-sample data is the data the model has never seen before. One of the pitfalls many new data scientists make is that they measure their model's effectiveness on the same data that the model learned from. What this ends up doing is rewarding the model's ability to memorize, rather than its ability to generalize, which is a huge difference.

If you take a look at the two examples here, the first presents a sample that the model learned from, and we can be reasonably confident that it's going to predict one, which would be correct. The second example presents a new sample, which appears to resemble more of the zero class. Of course, the model doesn't know that. But a good model should be able to recognize and generalize this pattern, shown as follows:

So, now the question is how we can ensure both in-sample and out-of-sample data for the model to prove its worth. Even more precisely, our out-of-sample data needs to be labeled. New or unlabeled data won't suffice because we have to know the actual answer in order to determine how correct the model is. So, one of the ways we can handle this in machine learning is to split our data into two parts: a training set and a testing set. The training set is what our model will learn on; the testing set is what our model will be evaluated on. How much data you have matters a lot. In fact, in the next sections, when we discuss the bias-variance trade-off, you'll see how some models require much more data to learn than others do.

Another thing to keep in mind is that if some of the distributions of your variables are highly skewed, or you have rare categorical levels embedded throughout, or even class imbalance in your y vector, you may end up getting a bad split. As an example, let's say you have a binary feature in your X matrix that indicates the presence of a very rare sensor for some event that occurs every 10,000 occurrences. If you randomly split your data and all of the positive sensor events are in your test set, then your model will learn from the training data that the sensor is never tripped and may deem that as an unimportant variable when, in reality, it could be hugely important, and hugely predictive. So, you can control these types of issues with stratification.

 

Splitting made easy

Here, we have a simple snippet that demonstrates how we can use the scikit-learn library to split our data into training and test sets. We're loading the data in from the datasets module and passing both X and y into the split function. We should be familiar with loading the data up. We have the train_test_split function from the model_selection submodule in sklearn. This is going to take any number of arrays. So, 20% is going to be test_size, and the remaining 80% of that data will be training. We define random_state, so that our split can be reproducible if we ever have to prove exactly how we got this split. There's also the stratify keyword, which we're not using here, which can be used to stratify a split for rare features or an imbalanced y vector:

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split



boston_housing = load_boston() # load data

X, y = boston_housing.data, boston_housing.target # get X, y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

                                                                                                              random_state=42)

# show num samples (there are no duplicates in either set!)
print("Num train samples: %i" % X_train.shape[0])

print("Num test samples: %i" % X_test.shape[0])

The output of the preceding code is as follows:

Num train samples: 404
Num test samples: 102