Book Image

Applied Unsupervised Learning with Python

By : Benjamin Johnston, Aaron Jones, Christopher Kruger
Book Image

Applied Unsupervised Learning with Python

By: Benjamin Johnston, Aaron Jones, Christopher Kruger

Overview of this book

Unsupervised learning is a useful and practical solution in situations where labeled data is not available. Applied Unsupervised Learning with Python guides you in learning the best practices for using unsupervised learning techniques in tandem with Python libraries and extracting meaningful information from unstructured data. The book begins by explaining how basic clustering works to find similar data points in a set. Once you are well-versed with the k-means algorithm and how it operates, you’ll learn what dimensionality reduction is and where to apply it. As you progress, you’ll learn various neural network techniques and how they can improve your model. While studying the applications of unsupervised learning, you will also understand how to mine topics that are trending on Twitter and Facebook and build a news recommendation engine for users. Finally, you will be able to put your knowledge to work through interesting activities such as performing a Market Basket Analysis and identifying relationships between different products. By the end of this book, you will have the skills you need to confidently build your own models using Python.
Table of Contents (12 chapters)
Applied Unsupervised Learning with Python

Chapter 1: Introduction to Clustering

Activity 1: Implementing k-means Clustering


  1. Load the Iris data file using pandas, a package that makes data wrangling much easier through the use of DataFrames:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.metrics import silhouette_score
    from scipy.spatial.distance import cdist
    iris = pd.read_csv('iris_data.csv', header=None)
    iris.columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'species']
  2. Separate out the X features and the provided y species labels, since we want to treat this as an unsupervised learning problem:

    X = iris[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
    y = iris['species']
  3. Get an idea of what our features look like:


    The output is as follows:

    Figure 1.22: First five rows of the data

  4. Bring back the k_means function we made earlier for reference:

    def k_means(X, K):
    #Keep track of history so you can see k-means in action
        centroids_history = []
        labels_history = []
        rand_index = np.random.choice(X.shape[0], K)  
        centroids = X[rand_index]
        while True:
    # Euclidean distances are calculated for each point relative to centroids, #and then np.argmin returns
    # the index location of the minimal distance - which cluster a point    is #assigned to
            labels = np.argmin(cdist(X, centroids), axis=1)
    #Take mean of points within clusters to find new centroids:
            new_centroids = np.array([X[labels == i].mean(axis=0)
                                    for i in range(K)])
            # If old centroids and new centroids no longer change, k-means is complete and end. Otherwise continue
            if np.all(centroids == new_centroids):
            centroids = new_centroids
        return centroids, labels, centroids_history, labels_history
  5. Convert our Iris X feature DataFrame to a NumPy matrix:

    X_mat = X.values
  6. Run our k_means function on the Iris matrix:

    centroids, labels, centroids_history, labels_history = k_means(X_mat, 3)
  7. See what labels we get by looking at just the list of predicted species per sample:


    The output is as follows:

    Figure 1.23: List of predicted species

  8. Visualize how our k-means implementation performed on the dataset:

    plt.scatter(X['SepalLengthCm'], X['SepalWidthCm'])
    plt.title('Iris - Sepal Length vs Width')

    The output is as follows:

    Figure 1.24: Plot of performed k-means implementation

    Visualize the clusters of Iris species as follows:

    plt.scatter(X['SepalLengthCm'], X['SepalWidthCm'], c=labels, cmap='tab20b')
    plt.title('Iris - Sepal Length vs Width - Clustered')

    The output is as follows:

    Figure 1.25: Clusters of Iris species

  9. Calculate the Silhouette Score using scikit-learn implementation:

    # Calculate Silhouette Score
    silhouette_score(X[['SepalLengthCm','SepalWidthCm']], labels)

    You will get an SSI roughly equal to 0.369. Since we are only using two features, this is acceptable, combined with the visualization of cluster memberships seen in the final plot.