Running k-means clustering with Mahout
Now that we have a sequence file of vectors suitable for consumption by Mahout, it's time to actually run k-means clustering on the whole dataset. Unlike our local Incanter version, Mahout won't have any trouble dealing with the full Reuters corpus.
As with the SequenceFilesFromDirectory
class, we've created a wrapper around another of Mahout's command-line programs, KMeansDriver
. The Clojure variable names make it easier to see what each command-line argument is for.
(defn run-kmeans [in-path clusters-path out-path k] (let [distance-measure "org.apache.mahout.common.distance.CosineDistanceMeasure" max-iterations 100 convergence-delta 0.001] (KMeansDriver/main (->> (vector "-i" in-path "-c" clusters-path "-o" out-path "-dm" distance-measure "-x" max-iterations "-k" k "-cd" convergence-delta ...