Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Running K-means with Mahout


K-means is a clustering algorithm. A clustering algorithm takes data points defined in an N-dimensional space and groups them into multiple clusters by considering the distance between those data points. A cluster is a set of data points such that the distance between the data points inside the cluster is much less than the distance from data points within cluster to data points outside the cluster. More details about the K-means clustering can be found from lecture 4 (http://www.youtube.com/watch?v=1ZDybXl212Q) of the Cluster computing and MapReduce lecture series by Google.

In this recipe, we will use a dataset that includes the Human Development Report (HDR) by country. The HDR describes different countries based on several human development measures. You can find the dataset at http://hdr.undp.org/en/statistics/data/. A sample of this dataset is available in the chapter7/resources/hdi-data.csv file in the sample source code repository. This recipe will use...