Book Image

Applied Unsupervised Learning with R

By : Alok Malik, Bradford Tuckfield
Book Image

Applied Unsupervised Learning with R

By: Alok Malik, Bradford Tuckfield

Overview of this book

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and features of R that enable you to understand your data better and get answers to your most pressing business questions. This book begins with the most important and commonly used method for unsupervised learning - clustering - and explains the three main clustering algorithms - k-means, divisive, and agglomerative. Following this, you'll study market basket analysis, kernel density estimation, principal component analysis, and anomaly detection. You'll be introduced to these methods using code written in R, with further instructions on how to work with, edit, and improve R code. To help you gain a practical understanding, the book also features useful tips on applying these methods to real business problems, including market segmentation and fraud detection. By working through interesting activities, you'll explore data encoders and latent variable models. By the end of this book, you will have a better understanding of different anomaly detection methods, such as outlier detection, Mahalanobis distances, and contextual and collective anomaly detection.
Table of Contents (9 chapters)

Chapter 1: Introduction to Clustering Methods


Activity 1: k-means Clustering with Three Clusters

Solution:

  1. Load the Iris dataset in the iris_data variable:

    iris_data<-iris
  2. Create a t_color column and make its default value red. Change the value of the two species to green and blue so the third one remains red:

    iris_data$t_color='red'
    iris_data$t_color[which(iris_data$Species=='setosa')]<-'green'
    iris_data$t_color[which(iris_data$Species=='virginica')]<-'blue'

    Note

    Here, we change the color column of only those values whose species is setosa or virginica)

  3. Choose any three random cluster centers:

    k1<-c(7,3)
    k2<-c(5,3)
    k3<-c(6,2.5)
  4. Plot the x, y plot by entering the sepal length and sepal width in the plot() function, along with color:

    plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$t_color)
    points(k1[1],k1[2],pch=4)
    points(k2[1],k2[2],pch=5)
    points(k3[1],k3[2],pch=6)

    Here is the output:

    Figure 1.36: Scatter plot for the given cluster centers

  5. Choose a number of iterations:

    number_of_steps<-10
  6. Choose an the initial value of n:

    n<-1
  7. Start the while loop for finding the cluster centers:

    while(n<number_of_steps){
  8. Calculate the distance of each point from the current cluster centers. We're calculating the Euclidean distance here using the sqrt function:

    iris_data$distance_to_clust1 <- sqrt((iris_data$Sepal.Length-k1[1])^2+(iris_data$Sepal.Width-k1[2])^2)
    iris_data$distance_to_clust2 <- sqrt((iris_data$Sepal.Length-k2[1])^2+(iris_data$Sepal.Width-k2[2])^2)
    iris_data$distance_to_clust3 <- sqrt((iris_data$Sepal.Length-k3[1])^2+(iris_data$Sepal.Width-k3[2])^2)
  9. Assign each point to a cluster to whose center it is closest:

      iris_data$clust_1 <- 1*(iris_data$distance_to_clust1<=iris_data$distance_to_clust2 & iris_data$distance_to_clust1<=iris_data$distance_to_clust3)
      iris_data$clust_2 <- 1*(iris_data$distance_to_clust1>iris_data$distance_to_clust2 & iris_data$distance_to_clust3>iris_data$distance_to_clust2)
      iris_data$clust_3 <- 1*(iris_data$distance_to_clust3<iris_data$distance_to_clust1 & iris_data$distance_to_clust3<iris_data$distance_to_clust2)
  10. Calculate new cluster centers by calculating the mean x and y coordinates of each center with the mean() function in R:

      k1[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_1==1)])
      k1[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_1==1)])
      k2[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_2==1)])
      k2[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_2==1)])
      k3[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_3==1)])
      k3[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_3==1)])
      n=n+1
    }
  11. Choose the color for each center to plot a scatterplot:

    iris_data$color='red'
    iris_data$color[which(iris_data$clust_2==1)]<-'blue'
    iris_data$color[which(iris_data$clust_3==1)]<-'green'
  12. Plot the final plot:

    plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$color)
    points(k1[1],k1[2],pch=4)
    points(k2[1],k2[2],pch=5)
    points(k3[1],k3[2],pch=6)

    The output is as follows:

    Figure 1.37: Scatter plot representing different species in different colors

Activity 2: Customer Segmentation with k-means

Solution:

  1. Download the data from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv.

  2. Read the data into the ws variable:

    ws<-read.csv('wholesale_customers_data.csv')
  3. Store only column 5 and 6 in the ws variable by discarding the rest of the columns:

    ws<-ws[5:6]
  4. Import the factoextra library:

    library(factoextra)
  5. Calculate the cluster centers for two centers:

    clus<-kmeans(ws,2)
  6. Plot the chart for two clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.38: Chart for two clusters

    Notice how outliers are also part of the two clusters.

  7. Calculate the cluster centers for three clusters:

    clus<-kmeans(ws,3)
  8. Plot the chart for three clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.39: Chart for three clusters

    Notice some outliers are now a part of a separate cluster.

  9. Calculate the cluster centers for four centers:

    clus<-kmeans(ws,4)
  10. Plot the chart for four clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.40: Chart for four clusters

    Notice how outliers have started separating in two different clusters.

  11. Calculate the cluster centers for five clusters:

    clus<-kmeans(ws,5)
  12. Plot the chart for five clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.41: Chart for five clusters

    Notice how outliers have clearly formed two separate clusters in red and blue, while the rest of the data is classified in three different clusters.

  13. Calculate the cluster centers for six clusters:

    clus<-kmeans(ws,6)
  14. Plot the chart for six clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.42: Chart for six clusters

Activity 3: Performing Customer Segmentation with k-medoids Clustering

Solution:

  1. Read the CSV file into the ws variable:

    ws<-read.csv('wholesale_customers_data.csv')
  2. Store only columns 5 and 6 in the ws variable:

    ws<-ws[5:6]
  3. Import the factoextra library for visualization:

    library(factoextra)
  4. Import the cluster library for clustering by PAM:

    library(cluster)
  5. Calculate clusters by entering data and the number of clusters in the pam function:

    clus<-pam(ws,4)
  6. Plot a visualization of the clusters:

    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.43: K-medoid plot of the clusters

  7. Again, calculate the clusters with k-means and plot the output to compare with the output of the pam clustering:

    clus<-kmeans(ws,4)
    fviz_cluster(clus,data=ws)

    The output is as follows:

    Figure 1.44: K-means plot of the clusters

Activity 4: Finding the Ideal Number of Market Segments

Solution:

  1. Read the downloaded dataset into the ws variable:

    ws<-read.csv('wholesale_customers_data.csv')
  2. Store only columns 5 and 6 in the variable by discarding other columns:

    ws<-ws[5:6]
  3. Calculate the optimal number of clusters with the silhouette score:

    fviz_nbclust(ws, kmeans, method = "silhouette",k.max=20)

    Here is the output:

    Figure 1.45: Graph representing optimal number of clusters with the silhouette score

    The optimal number of clusters, according to the silhouette score, is two.

  4. Calculate the optimal number of clusters with the WSS score:

    fviz_nbclust(ws, kmeans, method = "wss", k.max=20)

    Here is the output:

    Figure 1.46: Optimal number of clusters with the WSS score

    The optimum number of clusters according to the WSS elbow method is around six.

  5. Calculate the optimal number of clusters with the Gap statistic:

    fviz_nbclust(ws, kmeans, method = "gap_stat",k.max=20)

    Here is the output:

    Figure 1.47: Optimal number of clusters with the Gap statistic

    The optimal number of clusters according to the Gap statistic is one.