scikit-learn Cookbook - Second Edition

By : Trent Hauck

scikit-learn Cookbook - Second Edition

By: Trent Hauck

Overview of this book

Python is quickly becoming the go-to language for analysts and data scientists due to its simplicity and flexibility, and within the Python data space, scikit-learn is the unequivocal choice for machine learning. This book includes walk throughs and solutions to the common as well as the not-so-common problems in machine learning, and how scikit-learn can be leveraged to perform various machine learning tasks effectively. The second edition begins with taking you through recipes on evaluating the statistical properties of data and generates synthetic data for machine learning modelling. As you progress through the chapters, you will comes across recipes that will teach you to implement techniques like data pre-processing, linear regression, logistic regression, K-NN, Naïve Bayes, classification, decision trees, Ensembles and much more. Furthermore, you’ll learn to optimize your models with multi-class classification, cross validation, model evaluation and dive deeper in to implementing deep learning with scikit-learn. Along with covering the enhanced features on model section, API and new features like classifiers, regressors and estimators the book also contains recipes on evaluating and fine-tuning the performance of your model. By the end of this book, you will have explored plethora of features offered by scikit-learn for Python to solve any machine learning problem you come across.

Preface

What this book covers

Who this book is for

What you need for this book

Conventions

Reader feedback

Customer support

Free Chapter

High-Performance Machine Learning – NumPy

Introduction

NumPy basics

Loading the iris dataset

Viewing the iris dataset

Viewing the iris dataset with Pandas

Plotting with NumPy and matplotlib

A minimal machine learning recipe – SVM classification

Introducing cross-validation

Putting it all together

Machine learning overview – classification versus regression

Pre-Model Workflow and Pre-Processing

Introduction

Creating sample data for toy analysis

Scaling data to the standard normal distribution

Creating binary features through thresholding

Working with categorical variables

Imputing missing values through various strategies

A linear model in the presence of outliers

Putting it all together with pipelines

Using Gaussian processes for regression

Using SGD for regression

Dimensionality Reduction

Introduction

Reducing dimensionality with PCA

Using factor analysis for decomposition

Using kernel PCA for nonlinear dimensionality reduction

Using truncated SVD to reduce dimensionality

Using decomposition to classify with DictionaryLearning

Doing dimensionality reduction with manifolds – t-SNE

Testing methods to reduce dimensionality with pipelines

Linear Models with scikit-learn

Introduction

Fitting a line through data

Fitting a line through data with machine learning

Evaluating the linear regression model

Using ridge regression to overcome linear regression's shortfalls

Optimizing the ridge regression parameter

Using sparsity to regularize models

Taking a more fundamental approach to regularization with LARS

References

Linear Models – Logistic Regression

Introduction

Loading data from the UCI repository

Viewing the Pima Indians diabetes dataset with pandas

Looking at the UCI Pima Indians dataset web page

Machine learning with logistic regression

Examining logistic regression errors with a confusion matrix

Varying the classification threshold in logistic regression

Receiver operating characteristic – ROC analysis

Plotting an ROC curve without context

Putting it all together – UCI breast cancer dataset

Building Models with Distance Metrics

Introduction

Using k-means to cluster data

Optimizing the number of centroids

Assessing cluster correctness

Using MiniBatch k-means to handle more data

Quantizing an image with k-means clustering

Finding the closest object in the feature space

Probabilistic clustering with Gaussian mixture models

Using k-means for outlier detection

Using KNN for regression

Cross-Validation and Post-Model Workflow

Introduction

Selecting a model with cross-validation

K-fold cross validation

Balanced cross-validation

Cross-validation with ShuffleSplit

Time series cross-validation

Grid search with scikit-learn

Randomized search with scikit-learn

Classification metrics

Regression metrics

Clustering metrics

Using dummy estimators to compare results

Feature selection

Feature selection on L1 norms

Persisting models with joblib or pickle

Support Vector Machines

Introduction

Classifying data with a linear SVM

Optimizing an SVM

Multiclass classification with SVM

Support vector regression

Tree Algorithms and Ensembles

Introduction

Doing basic classifications with decision trees

Visualizing a decision tree with pydot

Tuning a decision tree

Using decision trees for regression

Reducing overfitting with cross-validation

Implementing random forest regression

Bagging regression with nearest neighbors

Tuning gradient boosting trees

Tuning an AdaBoost regressor

Writing a stacking aggregator with scikit-learn

Text and Multiclass Classification with scikit-learn

Using LDA for classification

Working with QDA – a nonlinear LDA

Using SGD for classification

Classifying documents with Naive Bayes

Label propagation with semi-supervised learning

Neural Networks

Introduction

Perceptron classifier

Neural network – multilayer perceptron

Stacking with a neural network

Create a Simple Estimator

Introduction

Create a simple estimator

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Machine learning overview – classification versus regression

In this recipe we will examine how regression can be viewed as being very similar to classification. This is done by reconsidering the categorical labels of regression as real numbers. In this section we will also look at at several aspects of machine learning from a very broad perspective including the purpose of scikit-learn. scikit-learn allows us to find models that work well incredibly quickly. We do not have to work out all the details of the model, or optimize, until we found one that works well. Consequently, your company saves precious development time and computational resources thanks to scikit-learn giving us the ability to develop models relatively quickly.

The purpose of scikit-learn

As we have seen before, scikit-learn allowed us to find a model that works fairly quickly. We tried SVC, logistic regression, and a few KNN classifiers. Through cross-validation, we selected models that performed better than others. In industry, after trying SVMs and logistic regression, we might focus on SVMs and optimize them further. Thanks to scikit-learn, we saved a lot of time and resources, including mental energy. After optimizing the SVM at work on a realistic dataset, we might re-implement it for speed in Java or C and gather more data.

Supervised versus unsupervised

Classification and regression are supervised, as we know the target variables for the observations. Clustering—creating regions in space for each category without being given any labels is unsupervised learning.

Getting ready

In classification, the target variable is one of several categories, and there must be more than one instance of every category. In regression, there can be only one instance of every target variable, as the only requirement is that the target is a real number.

In the case of logistic regression, we saw previously that the algorithm first performs a regression and estimates a real number for the target. Then the target class is estimated by using thresholds. In scikit-learn, there are predict_proba methods that yield probabilistic estimates, which relate regression-like real number estimates with classification classes in the style of logistic regression.

Any regression can be turned into classification by using thresholds. A binary classification can be viewed as a regression problem by using a regressor. The target variables produced will be real numbers, not the original class variables.

How to do it...

Quick SVC – a classifier and regressor

Load iris from the datasets module:

import numpy as np
import pandas as pd
from sklearn import datasets

iris = datasets.load_iris()

For simplicity, consider only targets 0 and 1, corresponding to Setosa and Versicolor. Use the Boolean array iris.target < 2 to filter out target 2. Place it within brackets to use it as a filter in defining the observation set X and the target set y:

X = iris.data[iris.target < 2]
y = iris.target[iris.target < 2]

Now import train_test_split and apply it:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state= 7)

Prepare and run an SVC by importing it and scoring it with cross-validation:

from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

svc_clf = SVC(kernel = 'linear').fit(X_train, y_train)
svc_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)

As done in previous sections, view the average of the scores:

svc_scores.mean()

0.94795321637426899

Perform the same with support vector regression by importing SVR from sklearn.svm, the same module that contains SVC:

from sklearn.svm import SVR

Then write the necessary syntax to fit the model. It is almost identical to the syntax for SVC, just replacing some c keywords with r:

svr_clf = SVR(kernel = 'linear').fit(X_train, y_train)

Making a scorer

To make a scorer, you need:

A scoring function that compares y_test, the ground truth, with y_pred, the predictions
To determine whether a high score is good or bad

Before passing the SVR regressor to the cross-validation, make a scorer by supplying two elements:

In practice, begin by importing the make_scorer function:

from sklearn.metrics import make_scorer

Use this sample scoring function:

#Only works for this iris example with targets 0 and 1
def for_scorer(y_test, orig_y_pred):
    y_pred = np.rint(orig_y_pred).astype(np.int)   #rounds prediction to the nearest integer
    return accuracy_score(y_test, y_pred)

The np.rint function rounds off the prediction to the nearest integer, hopefully one of the targets, 0 or 1. The astype method changes the type of the prediction to integer type, as the original target is in integer type and consistency is preferred with regard to types. After the rounding occurs, the scoring function uses the old accuracy_score function, which you are familiar with.

Now, determine whether a higher score is better. Higher accuracy is better, so for this situation, a higher score is better. In scikit code:

svr_to_class_scorer = make_scorer(for_scorer, greater_is_better=True)

Finally, run the cross-validation with a new parameter, the scoring parameter:

svr_scores = cross_val_score(svr_clf, X_train, y_train, cv=4, scoring = svr_to_class_scorer)

Find the mean:

svr_scores.mean()

0.94663742690058483

The accuracy scores are similar for the SVR regressor-based classifier and the traditional SVC classifier.

How it works...

You might ask, why did we take out class 2 out of the target set?

The reason is that, to use a regressor, our intent has to be to predict a real number. The categories had to have real number properties: that they are ordered (informally, if we have three ordered categories x, y, z and x < y and y < z then x < z). By eliminating the third category, the remaining flowers (Setosa and Versicolor) became ordered by a property we invented: Setosaness or Versicolorness.

The next time you encounter categories, you can consider whether they can be ordered. For example, if the dataset consists of shoe sizes, they can be ordered and a regressor can be applied, even though no one has a shoe size of 12.125.

There's more...

Linear versus nonlinear

Linear algorithms involve lines or hyperplanes. Hyperplanes are flat surfaces in any n-dimensional space. They tend to be easy to understand and explain, as they involve ratios (with an offset). Some functions that consistently and monotonically increase or decrease can be mapped to a linear function with a transformation. For example, exponential growth can be mapped to a line with the log transformation.

Nonlinear algorithms tend to be tougher to explain to colleagues and investors, yet ensembles of decision trees that are nonlinear tend to perform very well. KNN, which we examined earlier, is nonlinear. In some cases, functions not increasing or decreasing in a familiar manner are acceptable for the sake of accuracy.

Try a simple SVC with a polynomial kernel, as follows:

from sklearn.svm import SVC   #Usual import of SVC
svc_poly_clf = SVC(kernel = 'poly', degree= 3).fit(X_train, y_train)  #Polynomial Kernel of Degree 3

The polynomial kernel of degree 3 looks like a cubic curve in two dimensions. It leads to a slightly better fit, but note that it can be harder to explain to others than a linear kernel with consistent behavior throughout all of the Euclidean space:

svc_poly_scores = cross_val_score(svc_clf, X_train, y_train, cv=4)
svc_poly_scores.mean()

0.95906432748538006

Black box versus not

For the sake of efficiency, we did not examine the classification algorithms used very closely. When we compared SVC and logistic regression, we chose SVMs. At that point, both algorithms were black boxes, as we did not know any internal details. Once we decided to focus on SVMs, we could proceed to compute coefficients of the separating hyperplanes involved, optimize the hyperparameters of the SVM, use the SVM for big data, and do other processes. The SVMs have earned our time investment because of their superior performance.

Interpretability

Some machine learning algorithms are easier to understand than others. These are usually easier to explain to others as well. For example, linear regression is well known and easy to understand and explain to potential investors of your company. SVMs are more difficult to entirely understand.

My general advice: if SVMs are highly effective for a particular dataset, try to increase your personal interpretability of SVMs in the particular problem context. Also, consider merging algorithms somehow, using linear regression as an input to SVMs, for example. This way, you have the best of both worlds.

This is really context-specific, however. Linear SVMs are relatively simple to visualize and understand. Merging linear regression with SVM could complicate things. You can start by comparing them side by side.

However, if you cannot understand every detail of the math and practice of SVMs, be kind to yourself, as machine learning is focused more on prediction performance rather than traditional statistics.

A pipeline

In programming, a pipeline is a set of procedures connected in series, one after the other, where the output of one process is the input to the next:

You can replace any procedure in the process with a different one, perhaps better in some way, without compromising the whole system. For the model in the middle step, you can use an SVC or logistic regression:

One can also keep track of the classifier itself and build a flow diagram from the classifier. Here is a pipeline keeping track of the SVC classifier:

In the upcoming chapters, we will see how scikit-learn uses the intuitive notion of a pipeline. So far, we have used a simple one: train, predict, test.

scikit-learn Cookbook - Second Edition

By : Trent Hauck

scikit-learn Cookbook - Second Edition

By: Trent Hauck

Overview of this book

Related Content you might be interested in

Current Title:

scikit-learn Cookbook - Second Edition

Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits

Machine Learning with scikit-learn Quick Start Guide

Hands-On Ensemble Learning with Python