Applied Unsupervised Learning with Python

Applied Unsupervised Learning with Python

By : Benjamin Johnston, Aaron Jones, Christopher Kruger

Buy this Book

Applied Unsupervised Learning with Python

By: Benjamin Johnston, Aaron Jones, Christopher Kruger

Buy this Book

Overview of this book

Unsupervised learning is a useful and practical solution in situations where labeled data is not available. Applied Unsupervised Learning with Python guides you in learning the best practices for using unsupervised learning techniques in tandem with Python libraries and extracting meaningful information from unstructured data. The book begins by explaining how basic clustering works to find similar data points in a set. Once you are well-versed with the k-means algorithm and how it operates, you’ll learn what dimensionality reduction is and where to apply it. As you progress, you’ll learn various neural network techniques and how they can improve your model. While studying the applications of unsupervised learning, you will also understand how to mine topics that are trending on Twitter and Facebook and build a news recommendation engine for users. Finally, you will be able to put your knowledge to work through interesting activities such as performing a Market Basket Analysis and identifying relationships between different products. By the end of this book, you will have the skills you need to confidently build your own models using Python.

Applied Unsupervised Learning with Python

Preface

Free Chapter

Introduction to Clustering

Introduction

Unsupervised Learning versus Supervised Learning

Clustering

Introduction to k-means Clustering

Activity 1: Implementing k-means Clustering

Summary

Hierarchical Clustering

Introduction

Clustering Refresher

The Organization of Hierarchy

Introduction to Hierarchical Clustering

Linkage

Agglomerative versus Divisive Clustering

k-means versus Hierarchical Clustering

Summary

Neighborhood Approaches and DBSCAN

Introduction

Introduction to DBSCAN

DBSCAN Versus k-means and Hierarchical Clustering

Summary

Dimension Reduction and PCA

Introduction

Overview of Dimensionality Reduction Techniques

PCA

Summary

Autoencoders

Introduction

Fundamentals of Artificial Neural Networks

Autoencoders

Summary

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Introduction

Stochastic Neighbor Embedding (SNE)

t-Distributed SNE

Interpreting t-SNE Plots

Summary

Topic Modeling

Introduction

Cleaning Text Data

Latent Dirichlet Allocation

Non-Negative Matrix Factorization

Summary

Market Basket Analysis

Introduction

Market Basket Analysis

Characteristics of Transaction Data

Apriori Algorithm

Association Rules

Summary

Hotspot Analysis

Introduction

Kernel Density Estimation

Hotspot Analysis

Summary

Appendix

Chapter 1: Introduction to Clustering

Chapter 2: Hierarchical Clustering

Chapter 3: Neighborhood Approaches and DBSCAN

Chapter 4: Dimension Reduction and PCA

Chapter 5: Autoencoders

Chapter 6: t-Distributed Stochastic Neighbor Embedding (t-SNE)

Chapter 7: Topic Modeling

Chapter 8: Market Basket Analysis

Chapter 9: Hotspot Analysis

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 4: Dimension Reduction and PCA

Activity 6: Manual PCA versus scikit-learn

Solution

Import the pandas, numpy, and matplotlib plotting libraries and the scikit-learn PCA model:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

Load the dataset and select only the sepal features as per the previous exercises. Display the first five rows of the data:
```
df = pd.read_csv('iris-data.csv')
df = df[['Sepal Length', 'Sepal Width']]
df.head()
```
The output is as follows:
Figure 4.43: The first five rows of the data
Compute the covariance matrix for the data:
```
cov = np.cov(df.values.T)
cov
```
The output is as follows:
Figure 4.44: The covariance matrix for the data
Transform the data using the scikit-learn API and only the first principal component. Store the transformed data in the sklearn_pca variable:
```
model = PCA(n_components=1)
sklearn_pca = model.fit_transform(df.values)
```
Transform the data using the manual PCA and only the first principal component. Store the transformed data in the manual_pca variable.
```
eigenvectors, eigenvalues, _ = np.linalg.svd(cov, full_matrices=False)
P = eigenvectors[0]
manual_pca = P.dot(df.values.T)
```

Plot the sklearn_pca and manual_pca values on the same plot to visualize the difference:

plt.figure(figsize=(10, 7));
plt.plot(sklearn_pca, label='Scikit-learn PCA');
plt.plot(manual_pca, label='Manual PCA', linestyle='--');
plt.xlabel('Sample');
plt.ylabel('Transformed Value');
plt.legend();

The output is as follows:

Figure 4.45: A plot of the data

Notice that the two plots look almost identical, except that one is a mirror image of another and there is an offset between the two. Display the components of the sklearn_pca and manual_pca models:
```
model.components_
```
The output is as follows:
```
array([[ 0.99693955, -0.07817635]])
```
Now print P:
```
P
```
The output is as follows:
```
array([-0.99693955,  0.07817635])
```
Notice the difference in the signs; the values are identical, but the signs are different, producing the mirror image result. This is just a difference in convention, nothing meaningful.

Multiply the manual_pca models by -1 and re-plot:

manual_pca *= -1
plt.figure(figsize=(10, 7));
plt.plot(sklearn_pca, label='Scikit-learn PCA');
plt.plot(manual_pca, label='Manual PCA', linestyle='--');
plt.xlabel('Sample');
plt.ylabel('Transformed Value');
plt.legend();

The output is as follows:

Figure 4.46: Re-plotted data

Now, all we need to do is deal with the offset between the two. The scikit-learn API subtracts the mean of the data prior to the transform. Subtract the mean of each column from the dataset before completing the transform with manual PCA:
```
mean_vals = np.mean(df.values, axis=0)
offset_vals = df.values - mean_vals
manual_pca = P.dot(offset_vals.T)
```
Multiply the result by -1:
```
manual_pca *= -1
```

Re-plot the individual sklearn_pca and manual_pca values:

plt.figure(figsize=(10, 7));
plt.plot(sklearn_pca, label='Scikit-learn PCA');
plt.plot(manual_pca, label='Manual PCA', linestyle='--');
plt.xlabel('Sample');
plt.ylabel('Transformed Value');
plt.legend();

The output is as follows:

Figure 4.47: Re-plotting the data

The final plot will demonstrate that the dimensionality reduction completed by the two methods are, in fact, the same. The differences lie in the differences in the signs of the covariance matrices, as the two methods simply use a different feature as the baseline for comparison. Finally, there is also an offset between the two datasets, which is attributed to the mean samples being subtracted before executing the transform in the scikit-learn PCA.

Activity 7: PCA Using the Expanded Iris Dataset

Solution:

Import pandas and matplotlib. To enable 3D plotting, you will also need to import Axes3D:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D # Required for 3D plotting

Read in the dataset and select the columns Sepal Length, Sepal Width, and Petal Width:
```
df = pd.read_csv('iris-data.csv')[['Sepal Length', 'Sepal Width', 'Petal Width']]
df.head()
```
The output is as follows:
Figure 4.48: Sepal Length, Sepal Width, and Petal Width

Plot the data in three dimensions:

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width']);
ax.set_xlabel('Sepal Length (mm)');
ax.set_ylabel('Sepal Width (mm)');
ax.set_zlabel('Petal Width (mm)');
ax.set_title('Expanded Iris Dataset');

The output is as follows:

Figure 4.49: Expanded Iris dataset plot

Create a PCA model without specifying the number of components:
```
model = PCA()
```
Fit the model to the dataset:
```
model.fit(df.values)
```
The output is as follows:
Figure 4.50: The model fitted to the dataset
Display the eigenvalues or explained_variance_ratio_:
```
model.explained_variance_ratio_
```
The output is as follows:
```
array([0.8004668 , 0.14652357, 0.05300962])
```
We want to reduce the dimensionality of the dataset, but still keep at least 90% of the variance. What are the minimum number of components required to keep 90% of the variance?
The first two components are required for at least 90% variance. The first two components provide 94.7% of the variance within the dataset.
Create a new PCA model, this time specifying the number of components required to keep at least 90% of the variance:
```
model = PCA(n_components=2)
```

Transform the data using the new model:

data_transformed = model.fit_transform(df.values)

Plot the transformed data:
```
plt.figure(figsize=(10, 7))
plt.scatter(data_transformed[:,0], data_transformed[:,1]);
```
The output is as follows:
Figure 4.51: Plot of the transformed data

Restore the transformed data to the original dataspace:

data_restored = model.inverse_transform(data_transformed)

Plot the restored data in three dimensions in one subplot and the original data in a second subplot to visualize the effect of removing some of the variance:

fig = plt.figure(figsize=(10, 14))

# Original Data
ax = fig.add_subplot(211, projection='3d')
ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width'], label='Original Data');
ax.set_xlabel('Sepal Length (mm)');
ax.set_ylabel('Sepal Width (mm)');
ax.set_zlabel('Petal Width (mm)');
ax.set_title('Expanded Iris Dataset');

# Transformed Data
ax = fig.add_subplot(212, projection='3d')
ax.scatter(data_restored[:,0], data_restored[:,1], data_restored[:,2], label='Restored Data');
ax.set_xlabel('Sepal Length (mm)');
ax.set_ylabel('Sepal Width (mm)');
ax.set_zlabel('Petal Width (mm)');
ax.set_title('Restored Iris Dataset');

The output is as follows:

Figure 4.52: Plot of the expanded and the restored Iris datasets

Looking at Figure 4.52, we can see that, as we did with the 2D plots, we have removed much of the noise within the data, but retained the most important information regarding the trends within the data. It can be seen that in general, the sepal length increases with the petal width and that there seems to be two clusters of data within the plots, one sitting above the other.

Note

When applying PCA, it is important to keep in mind the size of the data being modelled, as well as the available system memory. The singular value decomposition process involves separating the data into the eigenvalues and eigenvectors, and can be quite memory intensive. If the dataset is too large, you may either be unable to complete the process, suffer significant performance loss, or lock up your system.

Applied Unsupervised Learning with Python

By : Benjamin Johnston, Aaron Jones, Christopher Kruger

Applied Unsupervised Learning with Python

By: Benjamin Johnston, Aaron Jones, Christopher Kruger

Overview of this book

Related Content you might be interested in

Current Title:

Applied Unsupervised Learning with Python

Applied Unsupervised Learning with R

Hands-On Unsupervised Learning with Python

Data Science with Python

Chapter 4: Dimension Reduction and PCA

Activity 6: Manual PCA versus scikit-learn

Activity 7: PCA Using the Expanded Iris Dataset

Note