Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Learning K-means


As you cannot do engineering without math, in the same way, you cannot start a clustering discussion without K-means. This is one of the basic and most useful algorithms.

The name of the algorithm is K-means because by using this, we divide the set of data into K-different clusters. So, this algorithm puts a hard limitation on the number of clusters formed. K-means algorithms follow these steps:

  1. The algorithm will start with the selection of the number of clusters—K.

  2. It will initialize the K centroid points in the cluster.

  3. Now, the closest points of each centroid are computed.

  4. Next, the centroid location is recomputed for each cluster.

  5. Steps 3 and 4 are repeated until the convergence is reached.

Convergence is reached when the location of centroids does not move from one iteration to the next. In an algorithm, we also provide a convergence threshold, which indicates that the centroid does not move more than this distance, and if it is reached, we stop the algorithm.

The K-means...