Book Image

Applied Unsupervised Learning with R

By : Alok Malik, Bradford Tuckfield
Book Image

Applied Unsupervised Learning with R

By: Alok Malik, Bradford Tuckfield

Overview of this book

Starting with the basics, Applied Unsupervised Learning with R explains clustering methods, distribution analysis, data encoders, and features of R that enable you to understand your data better and get answers to your most pressing business questions. This book begins with the most important and commonly used method for unsupervised learning - clustering - and explains the three main clustering algorithms - k-means, divisive, and agglomerative. Following this, you'll study market basket analysis, kernel density estimation, principal component analysis, and anomaly detection. You'll be introduced to these methods using code written in R, with further instructions on how to work with, edit, and improve R code. To help you gain a practical understanding, the book also features useful tips on applying these methods to real business problems, including market segmentation and fraud detection. By working through interesting activities, you'll explore data encoders and latent variable models. By the end of this book, you will have a better understanding of different anomaly detection methods, such as outlier detection, Mahalanobis distances, and contextual and collective anomaly detection.
Table of Contents (9 chapters)

Introduction to k-means Clustering with Built-In Functions


In this section, we're going to use some built-in libraries of R to perform k-means clustering instead of writing custom code, which is lengthy and prone to bugs and errors. Using pre-built libraries instead of writing our own code has other advantages, too:

  • Library functions are computationally efficient, as thousands of man hours have gone into the development of those functions.

  • Library functions are almost bug-free as they've been tested by thousands of people in almost all practically-usable scenarios.

  • Using libraries saves time, as you don't have to invest time in writing your own code.

k-means Clustering with Three Clusters

In the previous activity, we performed k-means clustering with three clusters by writing our own code. In this section, we're going to achieve a similar result with the help of pre-built R libraries.

At first, we're going to start with a distribution of three types of flowers in our dataset, as represented in the following graph:

Figure 1.17: A graph representing three species of iris in three colors

In the preceding plot, setosa is represented in blue, virginica in gray, and versicolor in pink.

With this dataset, we're going to perform k-means clustering and see whether the built-in algorithm is able to find a pattern on its own to classify these three species of iris using their sepal sizes. This time, we're going to use just four lines of code.

Exercise 3: k-means Clustering with R Libraries

In this exercise, we're going to learn to do k-means clustering in a much easier way with the pre-built libraries of R. By completing this exercise, you will be able to divide the three species of Iris into three separate clusters:

  1. We put the first two columns of the iris dataset, sepal length and sepal width, in the iris_data variable:

    iris_data<-iris[,1:2]
  2. We find the k-means cluster centers and the cluster to which each point belongs, and store it all in the km.res variable. Here, in the kmeans, function we enter the dataset as the first parameter, and the number of clusters we want as the second parameter:

    km.res<-kmeans(iris_data,3)

    Note

    The k-means function has many input variables, which can be altered to get different final outputs. You can find out more about them here in the documentation at https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/kmeans.

  3. Install the factoextra library as follows:

    install.packages('factoextra')
  4. We import the factoextra library for visualization of the clusters we just created. Factoextra is an R package that is used for plotting multivariate data:

    library("factoextra") 
  5. Generate the plot of the clusters. Here, we need to enter the results of k-means as the first parameter. In data, we need to enter the data on which clustering was done. In pallete, we're selecting the type of the geometry of points, and in ggtheme, we're selecting the theme of the output plot:

    fviz_cluster(km.res, data = iris_data,palette = "jco",ggtheme = theme_minimal())

    The output will be as follows:

    Figure 1.18: Three species of Iris have been clustered into three clusters

Here, if you compare Figure 1.18 to Figure 1.17, you will see that we have classified all three species almost correctly. The clusters we've generated don't exactly match the species shown in figure 1.18, but we've come very close considering the limitations of only using sepal length and width to classify them.

You can see from this example that clustering would've been a very useful way of categorizing the irises if we didn't already know their species. You will come across many examples of datasets where you don't have labeled categories, but are able to use clustering to form your own groupings.