Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Chapter 3. Understanding Canopy Clustering

In the previous chapter, we discussed K-means clustering and used Mahout to run K-means clustering on the text dataset. Therein, we discussed that one of the main challenges is to identify the initial number of clusters. We discussed the different techniques that we can use to identify the number of clusters in the dataset. One such technique is Canopy clustering. This algorithm is also called the preclustering algorithm. In this chapter, we will discuss Canopy clustering in detail. We will cover the following topics:

  • Learning Canopy clustering

  • Using Mahout to execute Canopy clustering

  • Visualizing Canopy cluster using Mahout

  • Working with CSV files

Canopy clustering, which is a pre-clustering algorithm, is used to estimate the approximate number of clusters in the dataset, as well as approximate centroids for those clusters. Canopy in Canopy cluster refers to the enclosure that has points. It is a very fast algorithm and runs without the initial set of...