Python Machine Learning Cookbook - Second Edition

By : Giuseppe Ciaburro, Prateek Joshi

Python Machine Learning Cookbook - Second Edition

By: Giuseppe Ciaburro, Prateek Joshi

Overview of this book

This eagerly anticipated second edition of the popular Python Machine Learning Cookbook will enable you to adopt a fresh approach to dealing with real-world machine learning and deep learning tasks. With the help of over 100 recipes, you will learn to build powerful machine learning applications using modern libraries from the Python ecosystem. The book will also guide you on how to implement various machine learning algorithms for classification, clustering, and recommendation engines, using a recipe-based approach. With emphasis on practical solutions, dedicated sections in the book will help you to apply supervised and unsupervised learning techniques to real-world problems. Toward the concluding chapters, you will get to grips with recipes that teach you advanced techniques including reinforcement learning, deep neural networks, and automated machine learning. By the end of this book, you will be equipped with the skills you need to apply machine learning techniques and leverage the full capabilities of the Python ecosystem through real-world examples.

Preface

Who this book is for

What this book covers

To get the most out of this book

Sections

Get in touch

Free Chapter

The Realm of Supervised Learning

Technical requirements

Introduction

Array creation in Python

Data preprocessing using mean removal

Building a linear regressor

Computing regression accuracy

Achieving model persistence

Building a ridge regressor

Building a polynomial regressor

Estimating housing prices

Computing the relative importance of features

Estimating bicycle demand distribution

Constructing a Classifier

Technical requirements

Introduction

Building a simple classifier

Building a logistic regression classifier

Building a Naive Bayes classifier

Splitting a dataset for training and testing

Evaluating accuracy using cross-validation metrics

Visualizing a confusion matrix

Extracting a performance report

Evaluating cars based on their characteristics

Extracting validation curves

Extracting learning curves

Estimating the income bracket

Predicting the quality of wine

Predictive Modeling

Technical requirements

Introduction

Building a linear classifier using SVMs

Building a nonlinear classifier using SVMs

Tackling class imbalance

Extracting confidence measurements

Finding optimal hyperparameters

Building an event predictor

Estimating traffic

Simplifying machine learning workflow using TensorFlow

Implementing a stacking method

Clustering with Unsupervised Learning

Technical requirements

Introduction

Clustering data using the k-means algorithm

Compressing an image using vector quantization

Grouping data using agglomerative clustering

Evaluating the performance of clustering algorithms

Estimating the number of clusters using the DBSCAN algorithm

Finding patterns in stock market data

Building a customer segmentation model

Using autoencoders to reconstruct handwritten digit images

Visualizing Data

Technical requirements

An introduction to data visualization

Plotting three-dimensional scatter plots

Plotting bubble plots

Animating bubble plots

Drawing pie charts

Plotting date-formatted time series data

Plotting histograms

Visualizing heat maps

Animating dynamic signals

Working with the Seaborn library

Building Recommendation Engines

Technical requirements

Introducing the recommendation engine

Building function compositions for data processing

Building machine learning pipelines

Finding the nearest neighbors

Constructing a k-nearest neighbors classifier

Constructing a k-nearest neighbors regressor

Computing the Euclidean distance score

Computing the Pearson correlation score

Finding similar users in the dataset

Generating movie recommendations

Implementing ranking algorithms

Building a filtering model using TensorFlow

Analyzing Text Data

Technical requirements

Introduction

Preprocessing data using tokenization

Stemming text data

Converting text to its base form using lemmatization

Dividing text using chunking

Building a bag-of-words model

Building a text classifier

Identifying the gender of a name

Analyzing the sentiment of a sentence

Identifying patterns in text using topic modeling

Parts of speech tagging with spaCy

Word2Vec using gensim

Shallow learning for spam detection

Speech Recognition

Technical requirements

Introducing speech recognition

Reading and plotting audio data

Transforming audio signals into the frequency domain

Generating audio signals with custom parameters

Synthesizing music

Extracting frequency domain features

Building HMMs

Building a speech recognizer

Building a TTS system

Dissecting Time Series and Sequential Data

Technical requirements

Introducing time series

Transforming data into a time series format

Slicing time series data

Operating on time series data

Extracting statistics from time series data

Building HMMs for sequential data

Building CRFs for sequential text data

Analyzing stock market data

Using RNNs to predict time series data

Analyzing Image Content

Technical requirements

Introducing computer vision

Operating on images using OpenCV-Python

Detecting edges

Histogram equalization

Detecting corners

Detecting SIFT feature points

Building a Star feature detector

Creating features using Visual Codebook and vector quantization

Training an image classifier using Extremely Random Forests

Building an object recognizer

Using Light GBM for image classification

Biometric Face Recognition

Technical requirements

Introduction

Capturing and processing video from a webcam

Building a face detector using Haar cascades

Building eye and nose detectors

Performing principal component analysis

Performing kernel principal component analysis

Performing blind source separation

Building a face recognizer using a local binary patterns histogram

Recognizing faces using the HOG-based model

Facial landmark recognition

User authentication by face recognition

Reinforcement Learning Techniques

Technical requirements

Introduction

Weather forecasting with MDP

Optimizing a financial portfolio using DP

Finding the shortest path

Deciding the discount factor using Q-learning

Implementing the deep Q-learning algorithm

Developing an AI-based dynamic modeling system

Deep reinforcement learning with double Q-learning

Deep Q-network algorithm with dueling Q-learning

Deep Neural Networks

Technical requirements

Introduction

Building a perceptron

Building a single layer neural network

Building a deep neural network

Creating a vector quantizer

Building a recurrent neural network for sequential data analysis

Visualizing the characters in an OCR database

Building an optical character recognizer using neural networks

Implementing optimization algorithms in ANN

Unsupervised Representation Learning

Technical requirements

Introduction

Using denoising autoencoders to detect fraudulent transactions

Generating word embeddings using CBOW and skipgram representations

Visualizing the MNIST dataset using PCA and t-SNE

Using word embedding for Twitter sentiment analysis

Implementing LDA with scikit-learn

Using LDA to classify text documents

Preparing data for LDA

Automated Machine Learning and Transfer Learning

Technical requirements

Introduction

Working with Auto-WEKA

Using AutoML to generate machine learning pipelines with TPOT

Working with Auto-Keras

Working with auto-sklearn

Using MLBox for selection and leak detection

Convolutional neural networks with transfer learning

Transfer learning with pretrained image classifiers using ResNet-50

Transfer learning using feature extraction with the VGG16 model

Transfer learning with pretrained GloVe embedding

Unlocking Production Issues

Technical requirements

Introduction

Handling unstructured data

Deploying machine learning models

Keeping track of changes into production

Tracking accuracy to optimize model scaling

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Estimating housing prices

It's time to apply our knowledge to a real-world problem. Let's apply all these principles to estimate house prices. This is one of the most popular examples that is used to understand regression, and it serves as a good entry point. This is intuitive and relatable, hence making it easier to understand the concepts before we perform more complex things in machine learning. We will use a decision tree regressor with AdaBoost to solve this problem.

Getting ready

A decision tree is a tree where each node makes a simple decision that contributes to the final output. The leaf nodes represent the output values, and the branches represent the intermediate decisions that were made, based on input features. AdaBoost stands for adaptive boosting, and this is a technique that is used to boost the accuracy of the results from another system. This combines the outputs from different versions of the algorithms, called weak learners, using a weighted summation to get the final output. The information that's collected at each stage of the AdaBoost algorithm is fed back into the system so that the learners at the latter stages focus on training samples that are difficult to classify. In this way, it increases the accuracy of the system.

Using AdaBoost, we fit a regressor on the dataset. We compute the error and then fit the regressor on the same dataset again, based on this error estimate. We can think of this as fine-tuning of the regressor until the desired accuracy is achieved. You are given a dataset that contains various parameters that affect the price of a house. Our goal is to estimate the relationship between these parameters and the house price so that we can use this to estimate the price given unknown input parameters.

How to do it...

Let's see how to estimate housing prices in Python:

Create a new file called housing.py and add the following lines:

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn import datasets
from sklearn.metrics import mean_squared_error, explained_variance_score
from sklearn.utils import shuffle
import matplotlib.pyplot as plt

There is a standard housing dataset that people tend to use to get started with machine learning. You can download it at https://archive.ics.uci.edu/ml/machine-learning-databases/housing/. We will be using a slightly modified version of the dataset, which has been provided along with the code files.
The good thing is that scikit-learn provides a function to directly load this dataset:

housing_data = datasets.load_boston()

Each data point has 12 input parameters that affect the price of a house. You can access the input data using housing_data.data and the corresponding price using housing_data.target. The following attributes are available:

crim: Per capita crime rate by town
zn: Proportion of residential land zoned for lots that are over 25,000 square feet
indus: Proportion of non-retail business acres per town
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox: Nitric oxides concentration (parts per ten million)
rm: Average number of rooms per dwelling
age: Proportion of owner-occupied units built prior to 1940
dis: Weighted distances to the five Boston employment centers
rad: Index of accessibility to radial highways
tax: Full-value property-tax rate per $10,000
ptratio: Pupil-teacher ratio by town
lstat: Percent of the lower status of the population
target: Median value of owner-occupied homes in $1000

Of these, target is the response variable, while the other 12 variables are possible predictors. The goal of this analysis is to fit a regression model that best explains the variation in target.

Let's separate this into input and output. To make this independent of the ordering of the data, let's shuffle it as well:

X, y = shuffle(housing_data.data, housing_data.target, random_state=7)

The sklearn.utils.shuffle() function shuffles arrays or sparse matrices in a consistent way to do random permutations of collections. Shuffling data reduces variance and makes sure that the patterns remain general and less overfitted. The random_state parameter controls how we shuffle data so that we can have reproducible results.

Let's divide the data into training and testing. We'll allocate 80% for training and 20% for testing:

num_training = int(0.8 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]

Remember, machine learning algorithms, train models by using a finite set of training data. In the training phase, the model is evaluated based on its predictions of the training set. But the goal of the algorithm is to produce a model that predicts previously unseen observations, in other words, one that is able to generalize the problem by starting from known data and unknown data. For this reason, the data is divided into two datasets: training and test. The training set is used to train the model, while the test set is used to verify the ability of the system to generalize.

We are now ready to fit a decision tree regression model. Let's pick a tree with a maximum depth of 4, which means that we are not letting the tree become arbitrarily deep:

dt_regressor = DecisionTreeRegressor(max_depth=4)
dt_regressor.fit(X_train, y_train)

The DecisionTreeRegressor function has been used to build a decision tree regressor.

Let's also fit the decision tree regression model with AdaBoost:

ab_regressor = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=400, random_state=7)
ab_regressor.fit(X_train, y_train)

The AdaBoostRegressor function has been used to compare the results and see how AdaBoost really boosts the performance of a decision tree regressor.

Let's evaluate the performance of the decision tree regressor:

y_pred_dt = dt_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred_dt)
evs = explained_variance_score(y_test, y_pred_dt)
print("#### Decision Tree performance ####")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))

First, we used the predict() function to predict the response variable based on the test data. Next, we calculated mean squared error and explained variance. Mean squared error is the average of the squared difference between actual and predicted values across all data points in the input. The explained variance is an indicator that, in the form of proportion, indicates how much variability of our data is explained by the model in question.

Now, let's evaluate the performance of AdaBoost:

y_pred_ab = ab_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred_ab)
evs = explained_variance_score(y_test, y_pred_ab)
print("#### AdaBoost performance ####")
print("Mean squared error =", round(mse, 2))
print("Explained variance score =", round(evs, 2))

Here is the output on the Terminal:

#### Decision Tree performance ####
Mean squared error = 14.79
Explained variance score = 0.82

#### AdaBoost performance ####
Mean squared error = 7.54
Explained variance score = 0.91

The error is lower and the variance score is closer to 1 when we use AdaBoost, as shown in the preceding output.

How it works...

DecisionTreeRegressor builds a decision tree regressor. Decision trees are used to predict a response or class y, from several input variables; x1, x2,…,xn. If y is a continuous response, it's called a regression tree, if y is categorical, it's called a classification tree. The algorithm is based on the following procedure: We see the value of the input x_i at each node of the tree, and based on the answer, we continue to the left or to the right branch. When we reach a leaf, we will find the prediction. In regression trees, we try to divide the data space into tiny parts, where we can equip a simple different model on each of them. The non-leaf part of the tree is just the way to find out which model we will use for predicting it.

A regression tree is formed by a series of nodes that split the root branch into two child branches. Such subdivision continues to cascade. Each new branch, then, can go in another node, or remain a leaf with the predicted value.

There's more...

An AdaBoost regressor is a meta-estimator that starts by equipping a regressor on the actual dataset and adding additional copies of the regressor on the same dataset, but where the weights of instances are adjusted according to the error of the current prediction. As such, consecutive regressors look at difficult cases. This will help us compare the results and see how AdaBoost really boosts the performance of a decision tree regressor.

Python Machine Learning Cookbook - Second Edition

By : Giuseppe Ciaburro, Prateek Joshi

Python Machine Learning Cookbook - Second Edition

By: Giuseppe Ciaburro, Prateek Joshi

Overview of this book

Related Content you might be interested in

Current Title:

Python Machine Learning Cookbook - Second Edition

Artificial Intelligence with Python

Artificial Intelligence with Python

Keras Reinforcement Learning Projects