In this section, we're going to use some built-in libraries of R to perform k-means clustering instead of writing custom code, which is lengthy and prone to bugs and errors. Using pre-built libraries instead of writing our own code has other advantages, too:
Library functions are computationally efficient, as thousands of man hours have gone into the development of those functions.
Library functions are almost bug-free as they've been tested by thousands of people in almost all practically-usable scenarios.
Using libraries saves time, as you don't have to invest time in writing your own code.
In the previous activity, we performed k-means clustering with three clusters by writing our own code. In this section, we're going to achieve a similar result with the help of pre-built R libraries.
At first, we're going to start with a distribution of three types of flowers in our dataset, as represented in the following graph:
In the preceding plot, setosa is represented in blue, virginica in gray, and versicolor in pink.
With this dataset, we're going to perform k-means clustering and see whether the built-in algorithm is able to find a pattern on its own to classify these three species of iris using their sepal sizes. This time, we're going to use just four lines of code.
In this exercise, we're going to learn to do k-means clustering in a much easier way with the pre-built libraries of R. By completing this exercise, you will be able to divide the three species of Iris into three separate clusters:
We put the first two columns of the iris dataset, sepal length and sepal width, in the iris_data variable:
iris_data<-iris[,1:2]
We find the k-means cluster centers and the cluster to which each point belongs, and store it all in the km.res variable. Here, in the kmeans, function we enter the dataset as the first parameter, and the number of clusters we want as the second parameter:
km.res<-kmeans(iris_data,3)
Install the factoextra library as follows:
install.packages('factoextra')
We import the factoextra library for visualization of the clusters we just created. Factoextra is an R package that is used for plotting multivariate data:
library("factoextra")
Generate the plot of the clusters. Here, we need to enter the results of k-means as the first parameter. In data, we need to enter the data on which clustering was done. In pallete, we're selecting the type of the geometry of points, and in ggtheme, we're selecting the theme of the output plot:
fviz_cluster(km.res, data = iris_data,palette = "jco",ggtheme = theme_minimal())
The output will be as follows:
Here, if you compare Figure 1.18 to Figure 1.17, you will see that we have classified all three species almost correctly. The clusters we've generated don't exactly match the species shown in figure 1.18, but we've come very close considering the limitations of only using sepal length and width to classify them.
You can see from this example that clustering would've been a very useful way of categorizing the irises if we didn't already know their species. You will come across many examples of datasets where you don't have labeled categories, but are able to use clustering to form your own groupings.