Python Machine Learning Cookbook

Python Machine Learning Cookbook

By : Prateek Joshi, Vahid Mirjalili

Buy this Book

Python Machine Learning Cookbook

By: Prateek Joshi, Vahid Mirjalili

Buy this Book

Overview of this book

Machine learning is becoming increasingly pervasive in the modern data-driven world. It is used extensively across many fields such as search engines, robotics, self-driving cars, and more. With this book, you will learn how to perform various machine learning tasks in different environments. We’ll start by exploring a range of real-life scenarios where machine learning can be used, and look at various building blocks. Throughout the book, you’ll use a wide variety of machine learning algorithms to solve real-world problems and use Python to implement these algorithms. You’ll discover how to deal with various types of data and explore the differences between machine learning paradigms such as supervised and unsupervised learning. We also cover a range of regression techniques, classification algorithms, predictive modeling, data visualization techniques, recommendation engines, and more with the help of real-world examples.

Python Machine Learning Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

The Realm of Supervised Learning

Introduction

Preprocessing data using different techniques

Label encoding

Building a linear regressor

Computing regression accuracy

Achieving model persistence

Building a ridge regressor

Building a polynomial regressor

Estimating housing prices

Computing the relative importance of features

Estimating bicycle demand distribution

Constructing a Classifier

Introduction

Building a simple classifier

Building a logistic regression classifier

Building a Naive Bayes classifier

Splitting the dataset for training and testing

Evaluating the accuracy using cross-validation

Visualizing the confusion matrix

Extracting the performance report

Evaluating cars based on their characteristics

Extracting validation curves

Extracting learning curves

Estimating the income bracket

Predictive Modeling

Introduction

Building a linear classifier using Support Vector Machine (SVMs)

Building a nonlinear classifier using SVMs

Tackling class imbalance

Extracting confidence measurements

Finding optimal hyperparameters

Building an event predictor

Estimating traffic

Clustering with Unsupervised Learning

Introduction

Clustering data using the k-means algorithm

Compressing an image using vector quantization

Building a Mean Shift clustering model

Grouping data using agglomerative clustering

Evaluating the performance of clustering algorithms

Automatically estimating the number of clusters using DBSCAN algorithm

Finding patterns in stock market data

Building a customer segmentation model

Building Recommendation Engines

Introduction

Building function compositions for data processing

Building machine learning pipelines

Finding the nearest neighbors

Constructing a k-nearest neighbors classifier

Constructing a k-nearest neighbors regressor

Computing the Euclidean distance score

Computing the Pearson correlation score

Finding similar users in the dataset

Generating movie recommendations

Analyzing Text Data

Introduction

Preprocessing data using tokenization

Stemming text data

Converting text to its base form using lemmatization

Dividing text using chunking

Building a bag-of-words model

Building a text classifier

Identifying the gender

Analyzing the sentiment of a sentence

Identifying patterns in text using topic modeling

Speech Recognition

Introduction

Reading and plotting audio data

Transforming audio signals into the frequency domain

Generating audio signals with custom parameters

Synthesizing music

Extracting frequency domain features

Building Hidden Markov Models

Building a speech recognizer

Dissecting Time Series and Sequential Data

Introduction

Transforming data into the time series format

Slicing time series data

Operating on time series data

Extracting statistics from time series data

Building Hidden Markov Models for sequential data

Building Conditional Random Fields for sequential text data

Analyzing stock market data using Hidden Markov Models

Image Content Analysis

Introduction

Operating on images using OpenCV-Python

Detecting edges

Histogram equalization

Detecting corners

Detecting SIFT feature points

Building a Star feature detector

Creating features using visual codebook and vector quantization

Training an image classifier using Extremely Random Forests

Building an object recognizer

Biometric Face Recognition

Introduction

Capturing and processing video from a webcam

Building a face detector using Haar cascades

Building eye and nose detectors

Performing Principal Components Analysis

Performing Kernel Principal Components Analysis

Performing blind source separation

Building a face recognizer using Local Binary Patterns Histogram

Deep Neural Networks

Introduction

Building a perceptron

Building a single layer neural network

Building a deep neural network

Creating a vector quantizer

Building a recurrent neural network for sequential data analysis

Visualizing the characters in an optical character recognition database

Building an optical character recognizer using neural networks

Visualizing Data

Introduction

Plotting 3D scatter plots

Plotting bubble plots

Animating bubble plots

Drawing pie charts

Plotting date-formatted time series data

Plotting histograms

Visualizing heat maps

Animating dynamic signals

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preprocessing data using different techniques

In the real world, we usually have to deal with a lot of raw data. This raw data is not readily ingestible by machine learning algorithms. To prepare the data for machine learning, we have to preprocess it before we feed it into various algorithms.

Getting ready

Let's see how to preprocess data in Python. To start off, open a file with a .py extension, for example, preprocessor.py, in your favorite text editor. Add the following lines to this file:

import numpy as np
from sklearn import preprocessing

We just imported a couple of necessary packages. Let's create some sample data. Add the following line to this file:

data = np.array([[3, -1.5,  2, -5.4], [0,  4,  -0.3, 2.1], [1,  3.3, -1.9, -4.3]])

We are now ready to operate on this data.

How to do it…

Data can be preprocessed in many ways. We will discuss a few of the most commonly-used preprocessing techniques.

Mean removal

It's usually beneficial to remove the mean from each feature so that it's centered on zero. This helps us in removing any bias from the features. Add the following lines to the file that we opened earlier:

data_standardized = preprocessing.scale(data)
print "\nMean =", data_standardized.mean(axis=0)
print "Std deviation =", data_standardized.std(axis=0)

We are now ready to run the code. To do this, run the following command on your Terminal:

$ python preprocessor.py

You will see the following output on your Terminal:

Mean = [  5.55111512e-17  -1.11022302e-16  -7.40148683e-17  -7.40148683e-17]
Std deviation = [ 1.  1.  1.  1.]

You can see that the mean is almost 0 and the standard deviation is 1.

Scaling

The values of each feature in a datapoint can vary between random values. So, sometimes it is important to scale them so that this becomes a level playing field. Add the following lines to the file and run the code:

data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = data_scaler.fit_transform(data)
print "\nMin max scaled data =", data_scaled

After scaling, all the feature values range between the specified values. The output will be displayed, as follows:

Min max scaled data: 
[[ 1.          0.          1.          0.        ]
 [ 0.          1.          0.41025641  1.        ]
 [ 0.33333333  0.87272727  0.          0.14666667]]

Normalization

Data normalization is used when you want to adjust the values in the feature vector so that they can be measured on a common scale. One of the most common forms of normalization that is used in machine learning adjusts the values of a feature vector so that they sum up to 1. Add the following lines to the previous file:

data_normalized = preprocessing.normalize(data, norm='l1')
print "\nL1 normalized data =", data_normalized

If you run the Python file, you will get the following output:

L1 normalized data: 
[[ 0.25210084 -0.12605042  0.16806723 -0.45378151]
 [ 0.          0.625      -0.046875    0.328125  ]
 [ 0.0952381   0.31428571 -0.18095238 -0.40952381]]

This is used a lot to make sure that datapoints don't get boosted artificially due to the fundamental nature of their features.

Binarization

Binarization is used when you want to convert your numerical feature vector into a Boolean vector. Add the following lines to the Python file:

data_binarized = preprocessing.Binarizer(threshold=1.4).transform(data)
print "\nBinarized data =", data_binarized

Run the code again, and you will see the following output:

Binarized data:
[[ 1.  0.  1.  0.]
 [ 0.  1.  0.  1.]
 [ 0.  1.  0.  0.]]

This is a very useful technique that's usually used when we have some prior knowledge of the data.

One Hot Encoding

A lot of times, we deal with numerical values that are sparse and scattered all over the place. We don't really need to store these big values. This is where One Hot Encoding comes into picture. We can think of One Hot Encoding as a tool to tighten the feature vector. It looks at each feature and identifies the total number of distinct values. It uses a one-of-k scheme to encode the values. Each feature in the feature vector is encoded based on this. This helps us be more efficient in terms of space. For example, let's say we are dealing with 4-dimensional feature vectors. To encode the n-th feature in a feature vector, the encoder will go through the n-th feature in each feature vector and count the number of distinct values. If the number of distinct values is k, it will transform the feature into a k-dimensional vector where only one value is 1 and all other values are 0. Add the following lines to the Python file:

encoder = preprocessing.OneHotEncoder()
encoder.fit([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4, 3]])
encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print "\nEncoded vector =", encoded_vector

This is the expected output:

Encoded vector:
[[ 0.  0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]

In the above example, let's consider the third feature in each feature vector. The values are 1, 5, 2, and 4. There are four distinct values here, which means the one-hot encoded vector will be of length 4. If you want to encode the value 5, it will be a vector [0, 1, 0, 0]. Only one value can be 1 in this vector. The second element is 1, which indicates that the value is 5.

Python Machine Learning Cookbook

By : Prateek Joshi, Vahid Mirjalili

Python Machine Learning Cookbook

By: Prateek Joshi, Vahid Mirjalili

Overview of this book

Related Content you might be interested in

Current Title:

Python Machine Learning Cookbook

Preprocessing data using different techniques

Getting ready

How to do it…

Mean removal

Scaling

Normalization

Binarization

One Hot Encoding