Book Image

scikit-learn Cookbook

By : Trent Hauck
Book Image

scikit-learn Cookbook

By: Trent Hauck

Overview of this book

<p>Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. Its consistent API and plethora of features help solve any machine learning problem it comes across.</p> <p>The book starts by walking through different methods to prepare your data—be it a dataset with missing values or text columns that require the categories to be turned into indicator variables. After the data is ready, you'll learn different techniques aligned with different objectives—be it a dataset with known outcomes such as sales by state, or more complicated problems such as clustering similar customers. Finally, you'll learn how to polish your algorithm to ensure that it's both accurate and resilient to new datasets.</p>
Table of Contents (12 chapters)
scikit-learn Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Working with categorical variables


Categorical variables are a problem. On one hand they provide valuable information; on the other hand, it's probably text—either the actual text or integers corresponding to the text—like an index in a lookup table.

So, we clearly need to represent our text as integers for the model's sake, but we can't just use the id field or naively represent them. This is because we need to avoid a similar problem to the Creating binary features through thresholding recipe. If we treat data that is continuous, it must be interpreted as continuous.

Getting ready

The boston dataset won't be useful for this section. While it's useful for feature binarization, it won't suffice for creating features from categorical variables. For this, the iris dataset will suffice.

For this to work, the problem needs to be turned on its head. Imagine a problem where the goal is to predict the sepal width; therefore, the species of the flower will probably be useful as a feature.

Let's get the data sorted first:

>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X = iris.data
>>> y = iris.target

Now, with X and Y being as they normally will be, we'll operate on the data as one:

>>> import numpy as np
>>> d = np.column_stack((X, y))

How to do it...

Convert the text columns to three features:

>>> from sklearn import preprocessing
>>> text_encoder = preprocessing.OneHotEncoder()
>>> text_encoder.fit_transform(d[:, -1:]).toarray()[:5]
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.]])

How it works...

The encoder creates additional features for each categorical variable, and the value returned is a sparse matrix. The result is a sparse matrix by definition; each row of the new features has 0 everywhere, except for the column whose value is associated with the feature's category. Therefore, it makes sense to store this data in a sparse matrix.

text_encoder is now a standard scikit-learn model, which means that it can be used again:

>>> text_encoder.transform(np.ones((3, 1))).toarray()
array([[ 0.,  1.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.]])

There's more...

Other options exist to create categorical variables in scikit-learn and Python at large. DictVectorizer is a good option if you like to limit the dependencies of your projects to only scikit-learn and you have a fairly simple encoding scheme. However, if you require more sophisticated categorical encoding, patsy is a very good option.

DictVectorizer

Another option is to use DictVectorizer. This can be used to directly convert strings to features:

>>> from sklearn.feature_extraction import DictVectorizer
>>> dv = DictVectorizer()
>>> my_dict = [{'species': iris.target_names[i]} for i in y]
>>> dv.fit_transform(my_dict).toarray()[:5]
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 1.,  0.,  0.]])

Tip

Dictionaries can be viewed as a sparse matrix. They only contain entries for the nonzero values.

Patsy

patsy is another package useful to encode categorical variables. Often used in conjunction with StatsModels, patsy can turn an array of strings into a design matrix.

Tip

This section does not directly pertain to scikit-learn; therefore, skipping it is okay without impacting the understanding of how scikit-learn works.

For example, dm = patsy.design_matrix("x + y") will create the appropriate columns if x or y are strings. If they aren't, C(x) inside the formula will signify that it is a categorical variable.

For example, iris.target can be interpreted as a continuous variable if we don't know better. Therefore, use the following command:

>>> import patsy
>>> patsy.dmatrix("0 + C(species)", {'species': iris.target})
DesignMatrix with shape (150, 3)
  C(species)[0]  C(species)[1]  C(species)[2]
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
              1              0              0
[...]