Apache Mahout Essentials

The K-Means algorithm is a simple and fast algorithm for clustering. However, this algorithm has its own limitations in certain scenarios. So, we will explain other clustering algorithms that are available in Apache Mahout here.

Canopy clustering

The accuracy of the K-Means algorithm depends on the number of clusters (K) and the initial cluster points that we randomly generated.

K-Means used org.apache.mahout.clustering.kmeans.RandomSeedGenerator to determine initial clusters randomly. However, with this approach, there is no guarantee about the time to converge, so it might take a long time for a large dataset to converge. Sometimes, premature convergence may occur due to the inability to pass a local optimum.

As a solution, canopy clustering is used with K-Means clustering as the initial step to determine the initial centroids (without getting initial centroids randomly). This will speed up the clustering process for the K-Means algorithm and provide more accurate...

Apache Mahout Essentials

By : Jayani Withanawasam

Apache Mahout Essentials

By: Jayani Withanawasam

Overview of this book

Related Content you might be interested in

Current Title:

Apache Mahout Essentials

Additional clustering algorithms

Canopy clustering