Python Machine Learning Cookbook

Python Machine Learning Cookbook

By : Prateek Joshi, Vahid Mirjalili

Buy this Book

Python Machine Learning Cookbook

By: Prateek Joshi, Vahid Mirjalili

Buy this Book

Overview of this book

Machine learning is becoming increasingly pervasive in the modern data-driven world. It is used extensively across many fields such as search engines, robotics, self-driving cars, and more. With this book, you will learn how to perform various machine learning tasks in different environments. We’ll start by exploring a range of real-life scenarios where machine learning can be used, and look at various building blocks. Throughout the book, you’ll use a wide variety of machine learning algorithms to solve real-world problems and use Python to implement these algorithms. You’ll discover how to deal with various types of data and explore the differences between machine learning paradigms such as supervised and unsupervised learning. We also cover a range of regression techniques, classification algorithms, predictive modeling, data visualization techniques, recommendation engines, and more with the help of real-world examples.

Python Machine Learning Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

The Realm of Supervised Learning

Introduction

Preprocessing data using different techniques

Label encoding

Building a linear regressor

Computing regression accuracy

Achieving model persistence

Building a ridge regressor

Building a polynomial regressor

Estimating housing prices

Computing the relative importance of features

Estimating bicycle demand distribution

Constructing a Classifier

Introduction

Building a simple classifier

Building a logistic regression classifier

Building a Naive Bayes classifier

Splitting the dataset for training and testing

Evaluating the accuracy using cross-validation

Visualizing the confusion matrix

Extracting the performance report

Evaluating cars based on their characteristics

Extracting validation curves

Extracting learning curves

Estimating the income bracket

Predictive Modeling

Introduction

Building a linear classifier using Support Vector Machine (SVMs)

Building a nonlinear classifier using SVMs

Tackling class imbalance

Extracting confidence measurements

Finding optimal hyperparameters

Building an event predictor

Estimating traffic

Clustering with Unsupervised Learning

Introduction

Clustering data using the k-means algorithm

Compressing an image using vector quantization

Building a Mean Shift clustering model

Grouping data using agglomerative clustering

Evaluating the performance of clustering algorithms

Automatically estimating the number of clusters using DBSCAN algorithm

Finding patterns in stock market data

Building a customer segmentation model

Building Recommendation Engines

Introduction

Building function compositions for data processing

Building machine learning pipelines

Finding the nearest neighbors

Constructing a k-nearest neighbors classifier

Constructing a k-nearest neighbors regressor

Computing the Euclidean distance score

Computing the Pearson correlation score

Finding similar users in the dataset

Generating movie recommendations

Analyzing Text Data

Introduction

Preprocessing data using tokenization

Stemming text data

Converting text to its base form using lemmatization

Dividing text using chunking

Building a bag-of-words model

Building a text classifier

Identifying the gender

Analyzing the sentiment of a sentence

Identifying patterns in text using topic modeling

Speech Recognition

Introduction

Reading and plotting audio data

Transforming audio signals into the frequency domain

Generating audio signals with custom parameters

Synthesizing music

Extracting frequency domain features

Building Hidden Markov Models

Building a speech recognizer

Dissecting Time Series and Sequential Data

Introduction

Transforming data into the time series format

Slicing time series data

Operating on time series data

Extracting statistics from time series data

Building Hidden Markov Models for sequential data

Building Conditional Random Fields for sequential text data

Analyzing stock market data using Hidden Markov Models

Image Content Analysis

Introduction

Operating on images using OpenCV-Python

Detecting edges

Histogram equalization

Detecting corners

Detecting SIFT feature points

Building a Star feature detector

Creating features using visual codebook and vector quantization

Training an image classifier using Extremely Random Forests

Building an object recognizer

Biometric Face Recognition

Introduction

Capturing and processing video from a webcam

Building a face detector using Haar cascades

Building eye and nose detectors

Performing Principal Components Analysis

Performing Kernel Principal Components Analysis

Performing blind source separation

Building a face recognizer using Local Binary Patterns Histogram

Deep Neural Networks

Introduction

Building a perceptron

Building a single layer neural network

Building a deep neural network

Creating a vector quantizer

Building a recurrent neural network for sequential data analysis

Visualizing the characters in an optical character recognition database

Building an optical character recognizer using neural networks

Visualizing Data

Introduction

Plotting 3D scatter plots

Plotting bubble plots

Animating bubble plots

Drawing pie charts

Plotting date-formatted time series data

Plotting histograms

Visualizing heat maps

Animating dynamic signals

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Estimating bicycle demand distribution

Let's use a different regression method to solve the bicycle demand distribution problem. We will use the random forest regressor to estimate the output values. A random forest is a collection of decision trees. This basically uses a set of decision trees that are built using various subsets of the dataset, and then it uses averaging to improve the overall performance.

Getting ready

We will use the bike_day.csv file that is provided to you. This is also available at https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset. There are 16 columns in this dataset. The first two columns correspond to the serial number and the actual date, so we won't use them for our analysis. The last three columns correspond to different types of outputs. The last column is just the sum of the values in the fourteenth and fifteenth columns, so we can leave these two out when we build our model.

How to do it…

Let's go ahead and see how to do this in Python. You have been provided with a file called bike_sharing.py that contains the full code. We will discuss the important parts of this, as follows:

We first need to import a couple of new packages, as follows:

import csv
from sklearn.ensemble import RandomForestRegressor
from housing import plot_feature_importances

We are processing a CSV file, so the CSV package is useful in handling these files. As it's a new dataset, we will have to define our own dataset loading function:

def load_dataset(filename):
    file_reader = csv.reader(open(filename, 'rb'), delimiter=',')
    X, y = [], []
    for row in file_reader:
        X.append(row[2:13])
        y.append(row[-1])

    # Extract feature names
    feature_names = np.array(X[0])

    # Remove the first row because they are feature names
    return np.array(X[1:]).astype(np.float32), np.array(y[1:]).astype(np.float32), feature_names

In this function, we just read all the data from the CSV file. The feature names are useful when we display it on a graph. We separate the data from the output values and return them.

Let's read the data and shuffle it to make it independent of the order in which the data is arranged in the file:
```
X, y, feature_names = load_dataset(sys.argv[1])
X, y = shuffle(X, y, random_state=7)  
```
As we did earlier, we need to separate the data into training and testing. This time, let's use 90% of the data for training and the remaining 10% for testing:
```
num_training = int(0.9 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]
```
Let's go ahead and train the regressor:
```
rf_regressor = RandomForestRegressor(n_estimators=1000, max_depth=10, min_samples_split=1)
rf_regressor.fit(X_train, y_train)
```
Here, n_estimators refers to the number of estimators, which is the number of decision trees that we want to use in our random forest. The max_depth parameter refers to the maximum depth of each tree, and the min_samples_split parameter refers to the number of data samples that are needed to split a node in the tree.

Let's evaluate performance of the random forest regressor:

y_pred = rf_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
evs = explained_variance_score(y_test, y_pred) 
print "\n#### Random Forest regressor performance ####"
print "Mean squared error =", round(mse, 2)
print "Explained variance score =", round(evs, 2)

As we already have the function to plot the importances feature, let's just call it directly:
```
plot_feature_importances(rf_regressor.feature_importances_, 'Random Forest regressor', feature_names)
```
Once you run this code, you will see the following graph:

Looks like the temperature is the most important factor controlling the bicycle rentals.

There's more…

Let's see what happens when you include fourteenth and fifteenth columns in the dataset. In the feature importance graph, every feature other than these two has to go to zero. The reason is that the output can be obtained by simply summing up the fourteenth and fifteenth columns, so the algorithm doesn't need any other features to compute the output. In the load_dataset function, make the following change inside the for loop:

X.append(row[2:15])

If you plot the feature importance graph now, you will see the following:

As expected, it says that only these two features are important. This makes sense intuitively because the final output is a simple summation of these two features. So, there is a direct relationship between these two variables and the output value. Hence, the regressor says that it doesn't need any other variable to predict the output. This is an extremely useful tool to eliminate redundant variables in your dataset.

There is another file called bike_hour.csv that contains data about how the bicycles are shared hourly. We need to consider columns 3 to 14, so let's make this change inside the load_dataset function:

X.append(row[2:14])

If you run this, you will see the performance of the regressor displayed, as follows:

#### Random Forest regressor performance ####
Mean squared error = 2619.87
Explained variance score = 0.92

The feature importance graph will look like the following:

This shows that the hour of the day is the most important feature, which makes sense intuitively if you think about it! The next important feature is temperature, which is consistent with our earlier analysis.

Python Machine Learning Cookbook

By : Prateek Joshi, Vahid Mirjalili

Python Machine Learning Cookbook

By: Prateek Joshi, Vahid Mirjalili

Overview of this book

Related Content you might be interested in

Current Title:

Python Machine Learning Cookbook

Estimating bicycle demand distribution

Getting ready

How to do it…

There's more…