Book Image

Applied Unsupervised Learning with Python

By : Benjamin Johnston, Aaron Jones, Christopher Kruger
Book Image

Applied Unsupervised Learning with Python

By: Benjamin Johnston, Aaron Jones, Christopher Kruger

Overview of this book

Unsupervised learning is a useful and practical solution in situations where labeled data is not available. Applied Unsupervised Learning with Python guides you in learning the best practices for using unsupervised learning techniques in tandem with Python libraries and extracting meaningful information from unstructured data. The book begins by explaining how basic clustering works to find similar data points in a set. Once you are well-versed with the k-means algorithm and how it operates, you’ll learn what dimensionality reduction is and where to apply it. As you progress, you’ll learn various neural network techniques and how they can improve your model. While studying the applications of unsupervised learning, you will also understand how to mine topics that are trending on Twitter and Facebook and build a news recommendation engine for users. Finally, you will be able to put your knowledge to work through interesting activities such as performing a Market Basket Analysis and identifying relationships between different products. By the end of this book, you will have the skills you need to confidently build your own models using Python.
Table of Contents (12 chapters)
Applied Unsupervised Learning with Python
Preface

Chapter 6: t-Distributed Stochastic Neighbor Embedding (t-SNE)


Activity 12: Wine t-SNE

Solution:

  1. Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.manifold import TSNE
  2. Load the Wine dataset using the wine.data file included in the accompanying source code and display the first five rows of data:

    df = pd.read_csv('wine.data', header=None)
    df.head()

    The output is as follows:

    Figure 6.24: The first five rows of the wine dataset.

  3. The first column contains the labels; extract this column and remove it from the dataset:

    labels = df[0]
    del df[0]
  4. Execute PCA to reduce the dataset to the first six components:

    model_pca = PCA(n_components=6)
    wine_pca = model_pca.fit_transform(df)
  5. Determine the amount of variance within the data described by these six components:

    np.sum(model_pca.explained_variance_ratio_)

    The output is as follows:

    0.99999314824536
  6. Create a t-SNE model using a specified random state and a verbose value of 1:

    tsne_model = TSNE(random_state=0, verbose=1)
    tsne_model

    The output is as follows:

    Figure 6.25: Creating t-SNE model.

  7. Fit the PCA data to the t-SNE model:

    wine_tsne = tsne_model.fit_transform(wine_pca.reshape((len(wine_pca), -1)))

    The output is as follows:

    Figure 6.26: Fitting PCA data t-SNE model

  8. Confirm that the shape of the t-SNE fitted data is two dimensional:

    wine_tsne.shape

    The output is as follows:

    (172, 8)
  9. Create a scatter plot of the two-dimensional data:

    plt.figure(figsize=(10, 7))
    plt.scatter(wine_tsne[:,0], wine_tsne[:,1]);
    plt.title('Low Dimensional Representation of Wine');
    plt.show()

    The output is as follows:

    Figure 6.27: Scatterplot of two-dimensional data

  10. Create a secondary scatter plot of the two-dimensional data with the class labels applied to visualize any clustering that may be present:

    MARKER = ['o', 'v', '^',]
    plt.figure(figsize=(10, 7))
    plt.title('Low Dimensional Representation of Wine');
    for i in range(1, 4):
        selections = wine_tsne[labels == i]
        plt.scatter(selections[:,0], selections[:,1], marker=MARKER[i-1], label=f'Wine {i}', s=30);
        plt.legend();
    plt.show()

    The output is as follows:

    Figure 6.28: Secondary plot of two-dimensional data

Note that while there is an overlap between the wine classes, it can also be seen that there is some clustering within the data. The first wine class is predominantly positioned in the top left-hand corner of the plot, the second wine class in the bottom-right, and the third wine class between the first two. This representation certainly couldn't be used to classify individual wine samples with great confidence, but it shows an overall trend and series of clusters contained within the high-dimensional data that we were unable to see earlier.

Activity 13: t-SNE Wine and Perplexity

Solution:

  1. Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.manifold import TSNE
  2. Load the Wine dataset and inspect the first five rows:

    df = pd.read_csv('wine.data', header=None)
    df.head()

    The output is as follows:

    Figure 6.29: The first five rows of wine data.

  3. The first column provides the labels; extract them from the DataFrame and store them in a separate variable. Ensure that the column is removed from the DataFrame:

    labels = df[0]
    del df[0]
  4. Execute PCA on the dataset and extract the first six components:

    model_pca = PCA(n_components=6)
    wine_pca = model_pca.fit_transform(df)
    wine_pca = wine_pca.reshape((len(wine_pca), -1))
  5. Construct a loop that iterates through the perplexity values (1, 5, 20, 30, 80, 160, 320). For each loop, generate a t-SNE model with the corresponding perplexity and print a scatter plot of the labeled wine classes. Note the effect of different perplexity values:

    MARKER = ['o', 'v', '^',]
    for perp in [1, 5, 20, 30, 80, 160, 320]:
        tsne_model = TSNE(random_state=0, verbose=1, perplexity=perp)
        wine_tsne = tsne_model.fit_transform(wine_pca)
        plt.figure(figsize=(10, 7))
        plt.title(f'Low Dimensional Representation of Wine. Perplexity {perp}');
        for i in range(1, 4):
            selections = wine_tsne[labels == i]
            plt.scatter(selections[:,0], selections[:,1], marker=MARKER[i-1], label=f'Wine {i}', s=30);
            plt.legend();

    A perplexity value of 1 fails to separate the data into any particular structure:

    Figure 6.30: Plot for perplexity value 1

    Increasing the perplexity to 5 leads to a very non-linear structure that is difficult to separate, and it's hard to identify any clusters or patterns:

    Figure 6.31: Plot for perplexity of 5

    A perplexity of 20 finally starts to show some sort of horse-shoe structure. While visually obvious, this can still be tricky to implement:

    Figure 6.32: Plot for perplexity of 20

    A perplexity of 30 demonstrates quite good results. There is a linear relationship between the projected structure with some separation between the types of wine:

    Figure 6.33: Plot for perplexity of 30

    Finally, the last two images in the activity show the extent to which the plots can become increasingly complex and non-linear with increasing perplexity:

    Figure 6.34: Plot for perplexity of 80

    Here's the plot for a perplexity of 160:

    Figure 6.35: Plot for perplexity of 160

Looking at the individual plots for each of the perplexity values, the effect perplexity has on the visualization of data is immediately obvious. Very small or very large perplexity values produces a range of unusual shapes that don't indicate the presence of any persistent pattern. The most plausible value seems to be 30, which produced the most linear plot we saw in the previous activity.

In this activity, we demonstrated the need to be careful when selecting the perplexity and that some iteration may be required to determine the correct value.

Activity 14: t-SNE Wine and Iterations

Solution:

  1. Import pandas, numpy, matplotlib, and the t-SNE and PCA models from scikit-learn:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.decomposition import PCA
    from sklearn.manifold import TSNE
  2. Load the Wine dataset and inspect the first five rows:

    df = pd.read_csv('wine.data', header=None)
    df.head()

    The output is as follows:

    Figure 6.36: The first five rows of wine dataset

  3. The first column provides the labels; extract these from the DataFrame and store them in a separate variable. Ensure that the column is removed from the DataFrame:

    labels = df[0]
    del df[0]
  4. Execute PCA on the dataset and extract the first six components:

    model_pca = PCA(n_components=6)
    wine_pca = model_pca.fit_transform(df)
    wine_pca = wine_pca.reshape((len(wine_pca), -1))
  5. Construct a loop that iterates through the iteration values (250, 500, 1000). For each loop, generate a t-SNE model with the corresponding number of iterations and identical number of iterations without progress values:

    MARKER = ['o', 'v', '1', 'p' ,'*', '+', 'x', 'd', '4', '.']
    for iterations in [250, 500, 1000]:
        model_tsne = TSNE(random_state=0, verbose=1, n_iter=iterations, n_iter_without_progress=iterations)
        mnist_tsne = model_tsne.fit_transform(mnist_pca)
  6. Construct a scatter plot of the labeled wine classes. Note the effect of different iteration values:

        plt.figure(figsize=(10, 7))
        plt.title(f'Low Dimensional Representation of MNIST (iterations = {iterations})');
        for i in range(10):
            selections = mnist_tsne[mnist['labels'] == i]
            plt.scatter(selections[:,0], selections[:,1], alpha=0.2, marker=MARKER[i], s=5);
            x, y = selections.mean(axis=0)
            plt.text(x, y, str(i), fontdict={'weight': 'bold', 'size': 30}) 

    The output is as follows:

    Figure 6.37: Scatterplot of wine classes with 250 iterations

    Here's the plot for 500 iterations:

    Figure 6.38: Scatterplot of wine classes with 500 iterations

    Here's the plot for 1,000 iterations:

    Figure 6.39: Scatterplot of wine classes with 1,000 iterations

Again, we can see the improvement in the structure of the data as the number of iterations increase. Even in a relatively simple dataset such as this, 250 iterations are not sufficient to project any structure of data into the lower-dimensional space.

As we observed in the corresponding activity, there is a balance to find in setting the iteration parameter. In this example, 250 iterations were insufficient, and at least 1,000 iterations were required for the final stabilization of the data.