Book Image

Programming MapReduce with Scalding

By : Antonios Chalkiopoulos
Book Image

Programming MapReduce with Scalding

By: Antonios Chalkiopoulos

Overview of this book

Table of Contents (16 chapters)
Programming MapReduce with Scalding
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

K-Means using Mahout


K-Means is a clustering algorithm that aims to partition n observations in k clusters.

Clustering is a form of unsupervised learning that can be successfully applied to a wide variety of problems. The algorithm is computationally difficult, and the open source project Mahout provides distributed implementations of many machine algorithms.

Note

Find more detailed information on K-Means at http://mahout.apache.org/users/clustering/k-means-clustering.html.

The K-Means algorithm assigns observations to the nearest cluster. Initially, the algorithm is instructed how many clusters to identify. For each cluster, a random centroid is generated. Samples are partitioned into clusters by minimizing a measure between the samples and the centroids of the cluster. In a number of iterations, the centroids and the assignments of samples in clusters are refined.

The distance between each sample and a centroid can be measured in a number of ways. Euclidean is usually used for samples in numerical...