Book Image

Apache Mahout Essentials

By : Jayani Withanawasam
Book Image

Apache Mahout Essentials

By: Jayani Withanawasam

Overview of this book

Table of Contents (13 chapters)
Apache Mahout Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Additional clustering algorithms


The K-Means algorithm is a simple and fast algorithm for clustering. However, this algorithm has its own limitations in certain scenarios. So, we will explain other clustering algorithms that are available in Apache Mahout here.

Canopy clustering

The accuracy of the K-Means algorithm depends on the number of clusters (K) and the initial cluster points that we randomly generated.

K-Means used org.apache.mahout.clustering.kmeans.RandomSeedGenerator to determine initial clusters randomly. However, with this approach, there is no guarantee about the time to converge, so it might take a long time for a large dataset to converge. Sometimes, premature convergence may occur due to the inability to pass a local optimum.

As a solution, canopy clustering is used with K-Means clustering as the initial step to determine the initial centroids (without getting initial centroids randomly). This will speed up the clustering process for the K-Means algorithm and provide more accurate...