Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Clustering using k-means


Cluster analysis or clustering is the process of grouping data into multiple groups so that the data in one group is similar to the data in other groups.

The following are a few examples where clustering is used:

  • Market segmentation: Dividing the target market into multiple segments so that the needs of each segment can be served better

  • Social network analysis: Finding a coherent group of people in the social network for ad targeting through a social networking site such as Facebook

  • Data center computing clusters: Putting a set of computers together to improve performance

  • Astronomical data analysis: Understanding astronomical data and events such as galaxy formations

  • Real estate: Identifying neighborhoods based on similar features

  • Text analysis: Dividing text documents, such as novels or essays, into genres

The k-means algorithm is best illustrated using imagery, so let's look at our sample figure again:

The first step in k-means is to randomly select two points called cluster...