Book Image

Applied Unsupervised Learning with Python

By : Benjamin Johnston, Aaron Jones, Christopher Kruger
Book Image

Applied Unsupervised Learning with Python

By: Benjamin Johnston, Aaron Jones, Christopher Kruger

Overview of this book

Unsupervised learning is a useful and practical solution in situations where labeled data is not available. Applied Unsupervised Learning with Python guides you in learning the best practices for using unsupervised learning techniques in tandem with Python libraries and extracting meaningful information from unstructured data. The book begins by explaining how basic clustering works to find similar data points in a set. Once you are well-versed with the k-means algorithm and how it operates, you’ll learn what dimensionality reduction is and where to apply it. As you progress, you’ll learn various neural network techniques and how they can improve your model. While studying the applications of unsupervised learning, you will also understand how to mine topics that are trending on Twitter and Facebook and build a news recommendation engine for users. Finally, you will be able to put your knowledge to work through interesting activities such as performing a Market Basket Analysis and identifying relationships between different products. By the end of this book, you will have the skills you need to confidently build your own models using Python.
Table of Contents (12 chapters)
Applied Unsupervised Learning with Python
Preface

Chapter 4: Dimension Reduction and PCA


Activity 6: Manual PCA versus scikit-learn

Solution

  1. Import the pandas, numpy, and matplotlib plotting libraries and the scikit-learn PCA model:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
  2. Load the dataset and select only the sepal features as per the previous exercises. Display the first five rows of the data:

    df = pd.read_csv('iris-data.csv')
    df = df[['Sepal Length', 'Sepal Width']]
    df.head()

    The output is as follows:

    Figure 4.43: The first five rows of the data

  3. Compute the covariance matrix for the data:

    cov = np.cov(df.values.T)
    cov

    The output is as follows:

    Figure 4.44: The covariance matrix for the data

  4. Transform the data using the scikit-learn API and only the first principal component. Store the transformed data in the sklearn_pca variable:

    model = PCA(n_components=1)
    sklearn_pca = model.fit_transform(df.values)
  5. Transform the data using the manual PCA and only the first principal component. Store the transformed data in the manual_pca variable.

    eigenvectors, eigenvalues, _ = np.linalg.svd(cov, full_matrices=False)
    P = eigenvectors[0]
    manual_pca = P.dot(df.values.T)
  6. Plot the sklearn_pca and manual_pca values on the same plot to visualize the difference:

    plt.figure(figsize=(10, 7));
    plt.plot(sklearn_pca, label='Scikit-learn PCA');
    plt.plot(manual_pca, label='Manual PCA', linestyle='--');
    plt.xlabel('Sample');
    plt.ylabel('Transformed Value');
    plt.legend();

    The output is as follows:

    Figure 4.45: A plot of the data

  7. Notice that the two plots look almost identical, except that one is a mirror image of another and there is an offset between the two. Display the components of the sklearn_pca and manual_pca models:

    model.components_

    The output is as follows:

    array([[ 0.99693955, -0.07817635]])

    Now print P:

    P

    The output is as follows:

    array([-0.99693955,  0.07817635])

    Notice the difference in the signs; the values are identical, but the signs are different, producing the mirror image result. This is just a difference in convention, nothing meaningful.

  8. Multiply the manual_pca models by -1 and re-plot:

    manual_pca *= -1
    plt.figure(figsize=(10, 7));
    plt.plot(sklearn_pca, label='Scikit-learn PCA');
    plt.plot(manual_pca, label='Manual PCA', linestyle='--');
    plt.xlabel('Sample');
    plt.ylabel('Transformed Value');
    plt.legend();

    The output is as follows:

    Figure 4.46: Re-plotted data

  9. Now, all we need to do is deal with the offset between the two. The scikit-learn API subtracts the mean of the data prior to the transform. Subtract the mean of each column from the dataset before completing the transform with manual PCA:

    mean_vals = np.mean(df.values, axis=0)
    offset_vals = df.values - mean_vals
    manual_pca = P.dot(offset_vals.T)
  10. Multiply the result by -1:

    manual_pca *= -1
  11. Re-plot the individual sklearn_pca and manual_pca values:

    plt.figure(figsize=(10, 7));
    plt.plot(sklearn_pca, label='Scikit-learn PCA');
    plt.plot(manual_pca, label='Manual PCA', linestyle='--');
    plt.xlabel('Sample');
    plt.ylabel('Transformed Value');
    plt.legend();

    The output is as follows:

    Figure 4.47: Re-plotting the data

The final plot will demonstrate that the dimensionality reduction completed by the two methods are, in fact, the same. The differences lie in the differences in the signs of the covariance matrices, as the two methods simply use a different feature as the baseline for comparison. Finally, there is also an offset between the two datasets, which is attributed to the mean samples being subtracted before executing the transform in the scikit-learn PCA.

Activity 7: PCA Using the Expanded Iris Dataset

Solution:

  1. Import pandas and matplotlib. To enable 3D plotting, you will also need to import Axes3D:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from mpl_toolkits.mplot3d import Axes3D # Required for 3D plotting
  2. Read in the dataset and select the columns Sepal Length, Sepal Width, and Petal Width:

    df = pd.read_csv('iris-data.csv')[['Sepal Length', 'Sepal Width', 'Petal Width']]
    df.head()

    The output is as follows:

    Figure 4.48: Sepal Length, Sepal Width, and Petal Width

  3. Plot the data in three dimensions:

    fig = plt.figure(figsize=(10, 7))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width']);
    ax.set_xlabel('Sepal Length (mm)');
    ax.set_ylabel('Sepal Width (mm)');
    ax.set_zlabel('Petal Width (mm)');
    ax.set_title('Expanded Iris Dataset');

    The output is as follows:

    Figure 4.49: Expanded Iris dataset plot

  4. Create a PCA model without specifying the number of components:

    model = PCA()
  5. Fit the model to the dataset:

    model.fit(df.values)

    The output is as follows:

    Figure 4.50: The model fitted to the dataset

  6. Display the eigenvalues or explained_variance_ratio_:

    model.explained_variance_ratio_

    The output is as follows:

    array([0.8004668 , 0.14652357, 0.05300962])
  7. We want to reduce the dimensionality of the dataset, but still keep at least 90% of the variance. What are the minimum number of components required to keep 90% of the variance?

    The first two components are required for at least 90% variance. The first two components provide 94.7% of the variance within the dataset.

  8. Create a new PCA model, this time specifying the number of components required to keep at least 90% of the variance:

    model = PCA(n_components=2)
  9. Transform the data using the new model:

    data_transformed = model.fit_transform(df.values)
  10. Plot the transformed data:

    plt.figure(figsize=(10, 7))
    plt.scatter(data_transformed[:,0], data_transformed[:,1]);

    The output is as follows:

    Figure 4.51: Plot of the transformed data

  11. Restore the transformed data to the original dataspace:

    data_restored = model.inverse_transform(data_transformed)
  12. Plot the restored data in three dimensions in one subplot and the original data in a second subplot to visualize the effect of removing some of the variance:

    fig = plt.figure(figsize=(10, 14))
    
    # Original Data
    ax = fig.add_subplot(211, projection='3d')
    ax.scatter(df['Sepal Length'], df['Sepal Width'], df['Petal Width'], label='Original Data');
    ax.set_xlabel('Sepal Length (mm)');
    ax.set_ylabel('Sepal Width (mm)');
    ax.set_zlabel('Petal Width (mm)');
    ax.set_title('Expanded Iris Dataset');
    
    # Transformed Data
    ax = fig.add_subplot(212, projection='3d')
    ax.scatter(data_restored[:,0], data_restored[:,1], data_restored[:,2], label='Restored Data');
    ax.set_xlabel('Sepal Length (mm)');
    ax.set_ylabel('Sepal Width (mm)');
    ax.set_zlabel('Petal Width (mm)');
    ax.set_title('Restored Iris Dataset');

    The output is as follows:

    Figure 4.52: Plot of the expanded and the restored Iris datasets

Looking at Figure 4.52, we can see that, as we did with the 2D plots, we have removed much of the noise within the data, but retained the most important information regarding the trends within the data. It can be seen that in general, the sepal length increases with the petal width and that there seems to be two clusters of data within the plots, one sitting above the other.

Note

When applying PCA, it is important to keep in mind the size of the data being modelled, as well as the available system memory. The singular value decomposition process involves separating the data into the eigenvalues and eigenvectors, and can be quite memory intensive. If the dataset is too large, you may either be unable to complete the process, suffer significant performance loss, or lock up your system.