Clustering with k-means and Incanter
Finally, having tokenized, stemmed, and vectorized our input documents—and with a selection of distance measures to choose from—we're in a position to run clustering on our data. The first clustering algorithm we'll look at is called k-means clustering.
k-means is an iterative algorithm that proceeds as follows:
Randomly pick k cluster centroids.
Assign each of the data points to the cluster with the closest centroid.
Adjust each cluster centroid to the mean of its assigned data points.
Repeat until convergence or the maximum number of iterations reached.
The process is visualized in the following diagram for k=3 clusters:
In the preceding figure, we can see that the initial cluster centroids at iteration 1 don't represent the structure of the data well. Although the points are clearly arranged in three groups, the initial centroids (represented by crosses) are all distributed around the top area of the graph. The points are colored according to their closest...