Book Image

Clojure for Data Science

By : Henry Garner
Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Table of Contents (18 chapters)
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Clustering with k-means and Incanter


Finally, having tokenized, stemmed, and vectorized our input documents—and with a selection of distance measures to choose from—we're in a position to run clustering on our data. The first clustering algorithm we'll look at is called k-means clustering.

k-means is an iterative algorithm that proceeds as follows:

  1. Randomly pick k cluster centroids.

  2. Assign each of the data points to the cluster with the closest centroid.

  3. Adjust each cluster centroid to the mean of its assigned data points.

  4. Repeat until convergence or the maximum number of iterations reached.

The process is visualized in the following diagram for k=3 clusters:

In the preceding figure, we can see that the initial cluster centroids at iteration 1 don't represent the structure of the data well. Although the points are clearly arranged in three groups, the initial centroids (represented by crosses) are all distributed around the top area of the graph. The points are colored according to their closest...