In the previous chapter, we discussed K-means clustering and used Mahout to run K-means clustering on the text dataset. Therein, we discussed that one of the main challenges is to identify the initial number of clusters. We discussed the different techniques that we can use to identify the number of clusters in the dataset. One such technique is Canopy clustering. This algorithm is also called the preclustering algorithm. In this chapter, we will discuss Canopy clustering in detail. We will cover the following topics:
Learning Canopy clustering
Using Mahout to execute Canopy clustering
Visualizing Canopy cluster using Mahout
Working with CSV files
Canopy clustering, which is a pre-clustering algorithm, is used to estimate the approximate number of clusters in the dataset, as well as approximate centroids for those clusters. Canopy in Canopy cluster refers to the enclosure that has points. It is a very fast algorithm and runs without the initial set of...