Applied Supervised Learning with Python

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur

Buy this Book

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Buy this Book

Overview of this book

Machine learning—the ability of a machine to give right answers based on input data—has revolutionized the way we do business. Applied Supervised Learning with Python provides a rich understanding of how you can apply machine learning techniques in your data science projects using Python. You'll explore Jupyter Notebooks, the technology used commonly in academic and commercial circles with in-line code running support. With the help of fun examples, you'll gain experience working on the Python machine learning toolkit—from performing basic data cleaning and processing to working with a range of regression and classification algorithms. Once you’ve grasped the basics, you'll learn how to build and train your own models using advanced techniques such as decision trees, ensemble modeling, validation, and error metrics. You'll also learn data visualization techniques using powerful Python libraries such as Matplotlib and Seaborn. This book also covers ensemble modeling and random forest classifiers along with other methods for combining results from multiple models, and concludes by delving into cross-validation to test your algorithm and check how well the model works on unseen data. By the end of this book, you'll be equipped to not only work with machine learning algorithms, but also be able to create some of your own!

Applied Supervised Learning with Python

Preface

Free Chapter

Python Machine Learning Toolkit

Introduction

Supervised Machine Learning

Jupyter Notebooks

pandas

Data Quality Considerations

Summary

Exploratory Data Analysis and Visualization

Introduction

Summary Statistics and Central Values

Missing Values

Distribution of Values

Relationships within the Data

Summary

Regression Analysis

Introduction

Regression and Classification Problems

Linear Regression

Multiple Linear Regression

Autoregression Models

Summary

Classification

Introduction

Linear Regression as a Classifier

Logistic Regression

Classification Using K-Nearest Neighbors

Classification Using Decision Trees

Summary

Ensemble Modeling

Introduction

Overfitting and Underfitting

Bagging

Boosting

Summary

Model Evaluation

Introduction

Evaluation Metrics

Splitting the Dataset

Performance Improvement Tactics

Summary

Appendix

Chapter 1: Python Machine Learning Toolkit

Chapter 2: Exploratory Data Analysis and Visualization

Chapter 3: Regression Analysis

Chapter 4: Classification

Chapter 5: Ensemble Modeling

Chapter 6: Model Evaluation

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 4: Classification

Activity 11: Linear Regression Classifier – Two-Class Classifier

Solution

Import the required dependencies:

import struct
import numpy as np
import gzip
import urllib.request
import matplotlib.pyplot as plt
from array import array
from sklearn.linear_model import LinearRegression

Load the MNIST data into memory:

with gzip.open('train-images-idx3-ubyte.gz', 'rb') as f:
    magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

    img = np.array(array("B", f.read())).reshape((size, rows, cols))


with gzip.open('train-labels-idx1-ubyte.gz', 'rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    labels = np.array(array("B", f.read()))


with gzip.open('t10k-images-idx3-ubyte.gz', 'rb') as f:
    magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

    img_test = np.array(array("B", f.read())).reshape((size, rows, cols))


with gzip.open('t10k-labels-idx1-ubyte.gz', 'rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    labels_test = np.array(array("B", f.read()))

Visualize a sample of the data:

for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(img[i], cmap='gray');
    plt.title(f'{labels[i]}');
    plt.axis('off')

We'll get the following output:

Figure 4.76: Sample data

Construct a linear classifier model to classify the digits zero and one. The model we are going to create is to determine whether the samples are either the digits zero or one. To do this, we first need to select only those samples:

samples_0_1 = np.where((labels == 0) | (labels == 1))[0]
images_0_1 = img[samples_0_1]
labels_0_1 = labels[samples_0_1]

samples_0_1_test = np.where((labels_test == 0) | (labels_test == 1))
images_0_1_test = img_test[samples_0_1_test].reshape((-1, rows * cols))
labels_0_1_test = labels_test[samples_0_1_test]

Visualize the selected information. Here's the code for zero:
```
sample_0 = np.where((labels == 0))[0][0]
plt.imshow(img[sample_0], cmap='gray');
```
The output will be as follows:
Figure 4.77: First sample data
Here's the code for one:
```
sample_1 = np.where((labels == 1))[0][0]
plt.imshow(img[sample_1], cmap='gray');
```
The output will be:
Figure 4.78: Second sample data
In order to provide the image information to the model, we must first flatten the data out so that each image is 1 x 784 pixels in shape:
```
images_0_1 = images_0_1.reshape((-1, rows * cols))
images_0_1.shape
```
The output will be:
```
(12665, 784)
```

Let's construct the model; use the LinearRegression API and call the fit function:

model = LinearRegression()
model.fit(X=images_0_1, y=labels_0_1)

The output will be:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Determine the R2 score against the training set:
```
model.score(X=images_0_1, y=labels_0_1)
```
The output will be:
```
0.9705320567708795
```
Determine the label predictions for each of the training samples, using a threshold of 0.5. Values greater than 0.5 classify as one; values less than or equal to 0.5 classify as zero:
```
y_pred = model.predict(images_0_1) > 0.5
y_pred = y_pred.astype(int)
y_pred
```
The output will be:
```
array([0, 1, 1, ..., 1, 0, 1])
```
Compute the classification accuracy of the predicted training values versus the ground truth:
```
np.sum(y_pred == labels_0_1) / len(labels_0_1)
```
The output will be:
```
0.9947887879984209
```

Compare the performance against the test set:

y_pred = model.predict(images_0_1_test) > 0.5
y_pred = y_pred.astype(int)
np.sum(y_pred == labels_0_1_test) / len(labels_0_1_test)

The output will be:

0.9938534278959811

Activity 12: Iris Classification Using Logistic Regression

Solution

Import the required packages. For this activity, we will require the pandas package for loading the data, the Matplotlib package for plotting, and scikit-learn for creating the logistic regression model. Import all the required packages and relevant modules for these tasks:
```
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
```
Load the Iris dataset using pandas and examine the first five rows:
```
df = pd.read_csv('iris-data.csv')
df.head()
```
The output will be:
Figure 4.79: The first five rows of the Iris dataset

The next step is feature engineering. We need to select the most appropriate features that will provide the most powerful classification model. Plot a number of different features versus the allocated species classifications, for example, sepal length versus petal length and species. Visually inspect the plots and look for any patterns that could indicate separation between each of the species:

markers = {
    'Iris-setosa': {'marker': 'x'},
    'Iris-versicolor': {'marker': '*'},
    'Iris-virginica': {'marker': 'o'},
}
plt.figure(figsize=(10, 7))
for name, group in df.groupby('Species'):
    plt.scatter(group['Sepal Width'], group['Petal Length'], 
                label=name,
                marker=markers[name]['marker'],
               )
    
plt.title('Species Classification Sepal Width vs Petal Length');
plt.xlabel('Sepal Width (mm)');
plt.ylabel('Petal Length (mm)');
plt.legend();

The output will be:

Figure 4.80: Species classification plot

Select the features by writing the column names in the following list:

selected_features = [
    'Sepal Width', # List features here
    'Petal Length'
]

Before we can construct the model, we must first convert the species values into labels that can be used within the model. Replace the Iris-setosa species string with the value 0, the Iris-versicolor species string with the value 1, and the Iris-virginica species string with the value 2:
```
species = [
    'Iris-setosa', # 0
    'Iris-versicolor', # 1
    'Iris-virginica', # 2
]
output = [species.index(spec) for spec in df.Species]
```

Create the model using the selected_features and the assigned species labels:

model = LogisticRegression(multi_class='auto', solver='lbfgs')
model.fit(df[selected_features], output)

The output will be:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='auto',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

Compute the accuracy of the model against the training set:
```
model.score(df[selected_features], output)
```
The output will be:
```
0.9533333333333334
```

Construct another model using your second choice selected_features and compare the performance:

selected_features = [
    'Sepal Length', # List features here
    'Petal Width'
]
model.fit(df[selected_features], output)
model.score(df[selected_features], output)

The output will be:

0.96

Construct another model using all available information and compare the performance:

selected_features = [
    'Sepal Length', # List features here
    'Sepal Width'
]
model.fit(df[selected_features], output)
model.score(df[selected_features], output)

The output will be:

0.82

Activity 13: K-NN Multiclass Classifier

Solution

Import the following packages:

import struct
import numpy as np
import gzip
import urllib.request
import matplotlib.pyplot as plt
from array import array
from sklearn.neighbors import KNeighborsClassifier as KNN

Load the MNIST data into memory.

Training images:

with gzip.open('train-images-idx3-ubyte.gz', 'rb') as f:
    magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

    img = np.array(array("B", f.read())).reshape((size, rows, cols))

Training labels:

with gzip.open('train-labels-idx1-ubyte.gz', 'rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    labels = np.array(array("B", f.read()))

Test images:

with gzip.open('t10k-images-idx3-ubyte.gz', 'rb') as f:
    magic, size, rows, cols = struct.unpack(">IIII", f.read(16))

    img_test = np.array(array("B", f.read())).reshape((size, rows, cols))

Test labels:

with gzip.open('t10k-labels-idx1-ubyte.gz', 'rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    labels_test = np.array(array("B", f.read()))

Visualize a sample of the data:

for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(img[i], cmap='gray');
    plt.title(f'{labels[i]}');
    plt.axis('off')

The output will be:

Figure 4.81: Sample images

Construct a K-NN classifier, with three nearest neighbors to classify the MNIST dataset. Again, to save processing power, randomly sample 5,000 images for use in training:
```
selection = np.random.choice(len(img), 5000)
selected_images = img[selection]
selected_labels = labels[selection]
```
In order to provide the image information to the model, we must first flatten the data out so that each image is 1 x 784 pixels in shape:
```
selected_images = selected_images.reshape((-1, rows * cols))
selected_images.shape
```
The output will be:
```
(5000, 784)
```

Build the three-neighbor KNN model and fit the data to the model. Note that, in this activity, we are providing 784 features or dimensions to the model, not simply 2:

model = KNN(n_neighbors=3)
model.fit(X=selected_images, y=selected_labels)

The output will be:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

Determine the score against the training set:
```
model.score(X=selected_images, y=selected_labels)
```
The output will be:
```
0.9692
```

Display the first two predictions for the model against the training data:

model.predict(selected_images)[:2]

plt.subplot(1, 2, 1)
plt.imshow(selected_images[0].reshape((28, 28)), cmap='gray');
plt.axis('off');
plt.subplot(1, 2, 2)
plt.imshow(selected_images[1].reshape((28, 28)), cmap='gray');
plt.axis('off');

The output will be as follows:

Figure 4.82: First predicted values

Compare the performance against the test set:

model.score(X=img_test.reshape((-1, rows * cols)), y=labels_test)

The output will be:

0.9376

Applied Supervised Learning with Python

By : Benjamin Johnston, Ishita Mathur

Applied Supervised Learning with Python

By: Benjamin Johnston, Ishita Mathur

Overview of this book

Related Content you might be interested in

Current Title:

Applied Supervised Learning with Python

Data Science for Marketing Analytics

Ensemble Machine Learning Cookbook

Machine Learning with scikit-learn Quick Start Guide

Chapter 4: Classification

Activity 11: Linear Regression Classifier – Two-Class Classifier

Activity 12: Iris Classification Using Logistic Regression

Activity 13: K-NN Multiclass Classifier