Rapid - Apache Mahout Clustering designs

Canopy clustering on mahout runs on Hadoop's MapReduce mode. The algorithm is implemented using the map reduce steps. It uses the Hadoop sequence file format as an input. The steps are as follows:

Convert the data into a form that you can use as an input. This is called data messaging.
As per the input set received, each mapper runs Canopy clustering and outputs its Canopy centers.
Reducers received the Canopy center and clusters these centers to produce the final Canopy center.
Data points are assigned to these Canopies.

The whole process we are referring to can be understood using the Canopy generation phase and the Canopy clustering phase. The process is available at https://mahout.apache.org/users/clustering/canopy-clustering.html

The Canopy generation phase

During the map step, each mapper processes a subset of the total points and applies the chosen distance measure and thresholds to generate Canopies. In the mapper, each point that is found to be within...

Rapid - Apache Mahout Clustering designs

Rapid - Apache Mahout Clustering designs

Overview of this book

Related Content you might be interested in

Current Title:

Rapid - Apache Mahout Clustering designs

Running Canopy clustering on Mahout

The Canopy generation phase