## Chapter 1: Introduction to Clustering Methods

### Activity 1: k-means Clustering with Three Clusters

Solution:

Load the Iris dataset in the

**iris_data**variable:iris_data<-iris

Create a

**t_color**column and make its default value**red**. Change the value of the two species to**green**and**blue**so the third one remains**red**:iris_data$t_color='red' iris_data$t_color[which(iris_data$Species=='setosa')]<-'green' iris_data$t_color[which(iris_data$Species=='virginica')]<-'blue'

### Note

Here, we change the

**color**column of only those values whose species is**setosa**or**virginica**)Choose any three random cluster centers:

k1<-c(7,3) k2<-c(5,3) k3<-c(6,2.5)

Plot the

**x**,**y**plot by entering the sepal length and sepal width in the**plot()**function, along with color:plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$t_color) points(k1[1],k1[2],pch=4) points(k2[1],k2[2],pch=5) points(k3[1],k3[2],pch=6)

Here is the output:

Choose a number of iterations:

number_of_steps<-10

Choose an the initial value of

**n**:n<-1

Start the

**while**loop for finding the cluster centers:while(n<number_of_steps){

Calculate the distance of each point from the current cluster centers. We're calculating the Euclidean distance here using the

**sqrt**function:iris_data$distance_to_clust1 <- sqrt((iris_data$Sepal.Length-k1[1])^2+(iris_data$Sepal.Width-k1[2])^2) iris_data$distance_to_clust2 <- sqrt((iris_data$Sepal.Length-k2[1])^2+(iris_data$Sepal.Width-k2[2])^2) iris_data$distance_to_clust3 <- sqrt((iris_data$Sepal.Length-k3[1])^2+(iris_data$Sepal.Width-k3[2])^2)

Assign each point to a cluster to whose center it is closest:

iris_data$clust_1 <- 1*(iris_data$distance_to_clust1<=iris_data$distance_to_clust2 & iris_data$distance_to_clust1<=iris_data$distance_to_clust3) iris_data$clust_2 <- 1*(iris_data$distance_to_clust1>iris_data$distance_to_clust2 & iris_data$distance_to_clust3>iris_data$distance_to_clust2) iris_data$clust_3 <- 1*(iris_data$distance_to_clust3<iris_data$distance_to_clust1 & iris_data$distance_to_clust3<iris_data$distance_to_clust2)

Calculate new cluster centers by calculating the mean

**x**and**y**coordinates of each center with the**mean()**function in R:k1[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_1==1)]) k1[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_1==1)]) k2[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_2==1)]) k2[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_2==1)]) k3[1]<-mean(iris_data$Sepal.Length[which(iris_data$clust_3==1)]) k3[2]<-mean(iris_data$Sepal.Width[which(iris_data$clust_3==1)]) n=n+1 }

Choose the color for each center to plot a scatterplot:

iris_data$color='red' iris_data$color[which(iris_data$clust_2==1)]<-'blue' iris_data$color[which(iris_data$clust_3==1)]<-'green'

Plot the final plot:

plot(iris_data$Sepal.Length,iris_data$Sepal.Width,col=iris_data$color) points(k1[1],k1[2],pch=4) points(k2[1],k2[2],pch=5) points(k3[1],k3[2],pch=6)

The output is as follows:

### Activity 2: Customer Segmentation with k-means

Solution:

Download the data from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-R/tree/master/Lesson01/Activity02/wholesale_customers_data.csv.

Read the data into the

**ws**variable:ws<-read.csv('wholesale_customers_data.csv')

Store only column 5 and 6 in the

**ws**variable by discarding the rest of the columns:ws<-ws[5:6]

Import the

**factoextra**library:library(factoextra)

Calculate the cluster centers for two centers:

clus<-kmeans(ws,2)

Plot the chart for two clusters:

fviz_cluster(clus,data=ws)

The output is as follows:

Notice how outliers are also part of the two clusters.

Calculate the cluster centers for three clusters:

clus<-kmeans(ws,3)

Plot the chart for three clusters:

fviz_cluster(clus,data=ws)

The output is as follows:

Notice some outliers are now a part of a separate cluster.

Calculate the cluster centers for four centers:

clus<-kmeans(ws,4)

Plot the chart for four clusters:

fviz_cluster(clus,data=ws)

The output is as follows:

Notice how outliers have started separating in two different clusters.

Calculate the cluster centers for five clusters:

clus<-kmeans(ws,5)

Plot the chart for five clusters:

fviz_cluster(clus,data=ws)

The output is as follows:

Notice how outliers have clearly formed two separate clusters in red and blue, while the rest of the data is classified in three different clusters.

Calculate the cluster centers for six clusters:

clus<-kmeans(ws,6)

Plot the chart for six clusters:

fviz_cluster(clus,data=ws)

The output is as follows:

### Activity 3: Performing Customer Segmentation with k-medoids Clustering

Solution:

Read the CSV file into the

**ws**variable:ws<-read.csv('wholesale_customers_data.csv')

Store only columns 5 and 6 in the

**ws**variable:ws<-ws[5:6]

Import the

**factoextra**library for visualization:library(factoextra)

Import the

**cluster**library for clustering by PAM:library(cluster)

Calculate clusters by entering data and the number of clusters in the

**pam**function:clus<-pam(ws,4)

Plot a visualization of the clusters:

fviz_cluster(clus,data=ws)

The output is as follows:

Again, calculate the clusters with k-means and plot the output to compare with the output of the

**pam**clustering:clus<-kmeans(ws,4) fviz_cluster(clus,data=ws)

The output is as follows:

### Activity 4: Finding the Ideal Number of Market Segments

Solution:

Read the downloaded dataset into the

**ws**variable:ws<-read.csv('wholesale_customers_data.csv')

Store only columns 5 and 6 in the variable by discarding other columns:

ws<-ws[5:6]

Calculate the optimal number of clusters with the silhouette score:

fviz_nbclust(ws, kmeans, method = "silhouette",k.max=20)

Here is the output:

The optimal number of clusters, according to the silhouette score, is two.

Calculate the optimal number of clusters with the WSS score:

fviz_nbclust(ws, kmeans, method = "wss", k.max=20)

Here is the output:

The optimum number of clusters according to the WSS elbow method is around six.

Calculate the optimal number of clusters with the Gap statistic:

fviz_nbclust(ws, kmeans, method = "gap_stat",k.max=20)

Here is the output:

The optimal number of clusters according to the Gap statistic is one.