Book Image

Applied Unsupervised Learning with R

By : Alok Malik, Bradford Tuckfield
Book Image

Applied Unsupervised Learning with R

By: Alok Malik, Bradford Tuckfield

Overview of this book

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and features of R that enable you to understand your data better and get answers to your most pressing business questions. This book begins with the most important and commonly used method for unsupervised learning - clustering - and explains the three main clustering algorithms - k-means, divisive, and agglomerative. Following this, you'll study market basket analysis, kernel density estimation, principal component analysis, and anomaly detection. You'll be introduced to these methods using code written in R, with further instructions on how to work with, edit, and improve R code. To help you gain a practical understanding, the book also features useful tips on applying these methods to real business problems, including market segmentation and fraud detection. By working through interesting activities, you'll explore data encoders and latent variable models. By the end of this book, you will have a better understanding of different anomaly detection methods, such as outlier detection, Mahalanobis distances, and contextual and collective anomaly detection.
Table of Contents (9 chapters)

Introduction to k-medoids Clustering


k-medoids is another type of clustering algorithm that can be used to find natural groupings in a dataset. k-medoids clustering is very similar to k-means clustering, except for a few differences. The k-medoids clustering algorithm has a slightly different optimization function than k-means. In this section, we're going to study k-medoids clustering.

The k-medoids Clustering Algorithm

There are many different types of algorithms to perform k-medoids clustering, the simplest and most efficient of which is Partitioning Around Medoids, or PAM for short. In PAM, we do the following steps to find cluster centers:

  1. Choose k data points from the scatter plot as starting points for cluster centers.

  2. Calculate their distance from all the points in the scatter plot.

  3. Classify each point into the cluster whose center it is closest to.

  4. Select a new point in each cluster that minimizes the sum of distances of all points in that cluster from itself.

  5. Repeat Step 2 until the centers stop changing.

You can see that the PAM algorithm is identical to the k-means clustering algorithm, except for Step 1 and Step 4. For most practical purposes, k-medoids clustering gives almost identical results to k-means clustering. But in some special cases where we have outliers in a dataset, k-medoids clustering is preferred as it's more robust to outliers. More about when to use which type of clustering and their differences will be studied in later sections.

k-medoids Clustering Code

In this section, we're going to use the same Iris flowers dataset that we used in the last two sections and compare to see whether the results are visibly different from the ones we got last time. Instead of writing code to perform each step of the k-medoids algorithm, we're directly going to use libraries of R to do PAM clustering.

Exercise 5: Implementing k-medoid Clustering

In this exercise, we're going to perform k-medoids with R's pre-built libraries:

  1. Store the first two columns of the iris dataset in the iris_data variable:

    iris_data<-iris[,1:2]
  2. Install the cluster package:

    install.packages("cluster")
  3. Import the cluster package:

    library("cluster")
  4. Store the PAM clustering results in the km.res variable:

    km<-pam(iris_data,3)
  5. Import the factoextra library:

    library("factoextra")
  6. Plot the PAM clustering results in a graph:

    fviz_cluster(km, data = iris_data,palette = "jco",ggtheme = theme_minimal())

    The output is as follows:

    Figure 1.21: Results of k-medoids clustering

The results of k-medoids clustering are not very different from those of the k-means clustering we did in the previous section.

So, we can see that the preceding PAM algorithm classifies our dataset into three clusters that are similar to the clusters we got with k-means clustering. If we plot the results of both types of clustering side by side, we can clearly see how similar they are:

Figure 1.22: The results of k-medoids clustering versus k-means clustering

In the preceding graphs, observe how the centers of both k-means and k-medoids clustering are so close to each other, but centers for k-medoids clustering are directly overlapping on points already in the data while the centers for k-means clustering are not.

k-means Clustering versus k-medoids Clustering

Now that we've studied both k-means and k-medoids clustering, which are almost identical to each other, we're going to study the differences between them and when to use which type of clustering:

  • Computational complexity: Of the two methods, k-medoids clustering is computationally more expensive. When our dataset is too large (>10,000 points) and we want to save computation time, we'll prefer k-means clustering over k-medoids clustering.

    Note

    Whether your dataset is large or not is entirely dependent on the computation power available. As computation gets cheaper over time, what is considered a large dataset will change in the future.

  • Presence of outliers: k-means clustering is more sensitive to outliers than k-medoids. Cluster center positions can shift significantly due to the presence of outliers in the dataset, so we use k-medoids clustering when we have to build clusters resilient to outliers.

  • Cluster centers: Both the k-means and k-medoids algorithms find cluster centers in different ways. The center of a k-medoids cluster is always a data point in the dataset. The center of a k-means cluster does not need to be a data point in the dataset.

Activity 3: Performing Customer Segmentation with k-medoids Clustering

Use the Wholesale customer dataset to perform both k-means and k-medoids clustering, and then compare the results. Read the data downloaded from the UCI machine learning repository into a variable. The data can be found at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Data/wholesale_customers_data.csv.

Note

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/00292/. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity03/wholesale_customers_data.csv.

These steps will help you complete the activity:

  1. Select only two columns, Grocery and Frozen, for easy two-dimensional visualization of clusters.

  2. Use k-medoids clustering to plot a graph showing four clusters for this data.

  3. Use k-means clustering to plot a four-cluster graph.

  4. Compare the two graphs to comment on how the results of the two methods differ.

The outcome will be a k-means plot of the clusters as follows:

Figure 1.23: The expected k-means plot of the cluster

Note

The solution for this activity can be found on page 206.

Deciding the Optimal Number of Clusters

Until now, we've been working on the iris flowers dataset, in which we know how many categories of flowers there are, and we chose to divide our dataset into three clusters based on this knowledge. But in unsupervised learning, our primary task is to work with data about which we don't have any information, such as how many natural clusters or categories there are in a dataset. Also, clustering can be a form of exploratory data analysis too, in which case, you won't have much information about the data. And sometimes, when the data has more than two dimensions, it becomes hard to visualize and find out the number of clusters manually. So, how do we find the optimal number of clusters in these scenarios? In this section, we're going to learn techniques to get the optimal value of the number of clusters.

Types of Clustering Metrics

There is more than one way of determining the optimal number of clusters in unsupervised learning. The following are the ones that we're going to study in this chapter:

  • The Silhouette score

  • The Elbow method / WSS

  • The Gap statistic

Silhouette Score

The silhouette score or average silhouette score calculation is used to quantify the quality of clusters achieved by a clustering algorithm. Let's take a point, a, in a cluster, x:

  1. Calculate the average distance between point a and all the points in cluster x (denoted by dxa):

    Figure 1.24: Calculating the average distance between point a and all the points of cluster x

  2. Calculate the average distance between point a and all the points in another cluster nearest to a (dya):

    Figure 1.25: Calculating the average distance between point a and all the points near cluster x

  3. Calculate the silhouette score for that point by dividing the difference of the result of Step 1 from the result of Step 2 by the max of the result of Step 1 and Step 2 ((dya-dxa)/max(dxa,dya)).

  4. Repeat the first three steps for all points in the cluster.

  5. After getting the silhouette score for every point in the cluster, the average of all those scores is the silhouette score of that cluster:

    Figure 1.26: Calculating the silhouette score

  6. Repeat the preceding steps for all the clusters in the dataset.

  7. After getting the silhouette score for all the clusters in the dataset, the average of all those scores is the silhouette score of that dataset:

    Figure 1.27: Calculating the average silhouette score

The silhouette score ranges between 1 and -1. If the silhouette score of a cluster is low (between 0 and -1), it means that the cluster is spread out or the distance between the points of that cluster is high. If the silhouette score of a cluster is high (close to 1), it means that the clusters are well defined and the distance between the points of a cluster is low and their distance from points of other clusters is high. So, the ideal silhouette score is near 1.

Understanding the preceding algorithm is important for forming an understanding of silhouette scores, but it's not important for learning how to implement it. So, we're going to learn how to do silhouette analysis in R using some pre-built libraries.

Exercise 6: Calculating the Silhouette Score

In this exercise, we're going to learn how to calculate the silhouette score of a dataset with a fixed number of clusters:

  1. Put the first two columns of the iris dataset, sepal length and sepal width, in the iris_data variable:

    iris_data<-iris[,1:2]
  2. Import the cluster library to perform k-means clustering:

    library(cluster)
  3. Store the k-means clusters in the km.res variable:

    km.res<-kmeans(iris_data,3)
  4. Store the pair-wise distance matrix for all data points in the pair_dis variable:

    pair_dis<-daisy(iris_data)
  5. Calculate the silhouette score for each point in the dataset:

    sc<-silhouette(km.res$cluster, pair_dis)
  6. Plot the silhouette score plot:

    plot(sc,col=1:8,border=NA)

    The output is as follows:

    Fig 1.28: The silhouette score for each point in every cluster is represented by a single bar

The preceding plot gives the average silhouette score of the dataset as 0.45. It also shows the average silhouette score cluster-wise and point-wise.

In the preceding exercise, we calculated the silhouette score for three clusters. But for deciding how many clusters to have, we'll have to calculate the silhouette score for multiple clusters in the dataset. In the next exercise, we're going to learn how to do this with the help of a library called factoextra in R.

Exercise 7: Identifying the Optimum Number of Clusters

In this exercise, we're going to identify the optimal number of clusters by calculating the silhouette score on various values of k in one line of code with the help of an R library:

From the preceding graph, you select a value of k that has the highest score; that is, 2. Two is the optimal number of clusters as per the silhouette score.

  1. Put first two columns, sepal length and sepal width, of the Iris data set in the iris_data variable:

    iris_data<-iris[,1:2]
  2. Import the factoextra library:

    library("factoextra")
  3. Plot a graph of the silhouette score versus number of clusters (up to 20):

    fviz_nbclust(iris_data, kmeans, method = "silhouette",k.max=20)

    Note

    In the second argument, you may change k-means to k-medoids or any other type of clustering. The k.max variable is the max number of clusters up to which the score is to be calculated. In the method argument of the function, you can enter three types of clustering metrics to be included. All three of them are discussed in this chapter.

    The output is as follows:

    Fig 1.29: The number of clusters versus average silhouette score

WSS/Elbow Method

To identify a cluster in a dataset, we try to minimize the distance between points in a cluster, and the Within-Sum-of-Squares (WSS) method measures exactly that. The WSS score is the sum of the squares of the distances of all points within a cluster. In this method, we perform the following steps:

  1. Calculate clusters by using different values of k.

  2. For every value of k, calculate WSS using the following formula:

    Figure 1.30: The formula to calculate WSS where p is the total number of dimensions of the data

    This formula is illustrated here:

    Figure 1.31: Illustration of WSS score

    Note

    Figure 1.31 illustrates WSS relative to two points, but in reality WSS measures sums of all distances relative to all points. within every cluster.

  3. Plot number of clusters k versus WSS score.

  4. Identify the k value after which the WSS score doesn't decrease significantly and choose this k as the ideal number of clusters. This point is also known as the elbow of the graph, hence the name "elbow method".

In the following exercise, we're going to learn how to identify the ideal number of clusters with the help of the factoextra library.

Exercise 8: Using WSS to Determine the Number of Clusters

In this exercise, we will see how we can use WSS to determine the number of clusters. Perform the following steps.

  1. Put the first two columns, sepal length and sepal width, of the Iris data set in the iris_data variable:

    iris_data<-iris[,1:2]
  2. Import the factoextra library:

    library("factoextra")
  3. Plot a graph of WSS versus number of clusters (up to 20):

    fviz_nbclust(iris_data, kmeans, method = "wss", k.max=20)

    The output is as follows:

    Fig 1.32: WSS versus number of clusters

In the preceding graph, we can choose the elbow of the graph as k=3, as the value of WSS starts dropping more slowly after k=3. Choosing the elbow of the graph is always a subjective choice and there could be times where you could choose k=4 or k=2 instead of k=3, but with this graph, it's clear that k>5 are inappropriate values for k as they are not the elbow of the graph, which is where the graph's slope changes sharply.

The Gap Statistic

The Gap statistic is one of the most effective methods of finding the optimal number of clusters in a dataset. It is applicable to any type of clustering method. The Gap statistic is calculated by comparing the WSS value for the clusters generated on our observed dataset versus a reference dataset in which there are no apparent clusters. The reference dataset is a uniform distribution of data points between the minimum and maximum values of our dataset on which we want to calculate the Gap statistic.

So, in short, the Gap statistic measures the WSS values for both observed and random datasets and finds the deviation of the observed dataset from the random dataset. To find the ideal number of clusters, we choose a value of k that gives us the maximum value of the Gap statistic. The mathematical details of how these deviations are measured are beyond the scope of this book. In the next exercise, we're going to learn how to calculate the Gap statistic with the help of the factoviz library.

Here is a reference dataset:

Figure 1.33: The reference dataset

The following is the observed dataset:

Figure 1.34: The observed dataset

Exercise 9: Calculating the Ideal Number of Clusters with the Gap Statistic

In this exercise, we will calculate the ideal number of clusters using the Gap statistic:

  1. Put the first two columns, sepal length and sepal width, of the Iris data set in the iris_data variable as follows:

    iris_data<-iris[,1:2]
  2. Import the factoextra library as follows:

    library("factoextra")
  3. Plot the graph of Gap statistics versus number of clusters (up to 20):

    fviz_nbclust(iris_data, kmeans, method = "gap_stat",k.max=20)

    Fig 1.35: Gap statistics versus number of clusters

As we can see in the preceding graph, the highest value of the Gap statistic is for k=3. Hence, the ideal number of clusters in the iris dataset is three. Three is also the number of species in the dataset, indicating that the gap statistic has enabled us to reach a correct conclusion.

Activity 4: Finding the Ideal Number of Market Segments

Find the optimal number of clusters in the wholesale customers dataset with all three of the preceding methods:

Note

This dataset is taken from the UCI Machine Learning Repository. You can find the dataset at https://archive.ics.uci.edu/ml/machine-learning-databases/00292/. We have downloaded the file and saved it at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity04/wholesale_customers_data.csv.

  1. Load columns 5 to 6 of the wholesale customers dataset in a variable.

  2. Calculate the optimal number of clusters for k-means clustering with the silhouette score.

  3. Calculate the optimal number of clusters for k-means clustering with the WSS score.

  4. Calculate the optimal number of clusters for k-means clustering with the Gap statistic.

The outcome will be three graphs representing the optimal number of clusters with the silhouette score, with the WSS score, and with the Gap statistic.

Note

The solution for this activity can be found on page 208.

As we have seen, each method will give a different value for the optimal number of clusters. Sometimes, the results won't make sense, as you saw in the case of the Gap statistic, which gave the optimal number of clusters as one, which would mean that clustering shouldn't be done on this dataset and all data points should be in a single cluster.

All the points in a given cluster will have similar properties. Interpretation of those properties is left to domain experts. And there's almost never a right answer for the right number of clusters in unsupervised learning.