Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Running Canopy clustering on Mahout


Canopy clustering on mahout runs on Hadoop's MapReduce mode. The algorithm is implemented using the map reduce steps. It uses the Hadoop sequence file format as an input. The steps are as follows:

  1. Convert the data into a form that you can use as an input. This is called data messaging.

  2. As per the input set received, each mapper runs Canopy clustering and outputs its Canopy centers.

  3. Reducers received the Canopy center and clusters these centers to produce the final Canopy center.

  4. Data points are assigned to these Canopies.

The whole process we are referring to can be understood using the Canopy generation phase and the Canopy clustering phase. The process is available at https://mahout.apache.org/users/clustering/canopy-clustering.html

The Canopy generation phase

During the map step, each mapper processes a subset of the total points and applies the chosen distance measure and thresholds to generate Canopies. In the mapper, each point that is found to be within...