Book Image

Clojure for Data Science

By : Henry Garner
Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Table of Contents (18 chapters)
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Running k-means clustering with Mahout


Now that we have a sequence file of vectors suitable for consumption by Mahout, it's time to actually run k-means clustering on the whole dataset. Unlike our local Incanter version, Mahout won't have any trouble dealing with the full Reuters corpus.

As with the SequenceFilesFromDirectory class, we've created a wrapper around another of Mahout's command-line programs, KMeansDriver. The Clojure variable names make it easier to see what each command-line argument is for.

(defn run-kmeans [in-path clusters-path out-path k]
  (let [distance-measure  "org.apache.mahout.common.distance.CosineDistanceMeasure"
        max-iterations    100
        convergence-delta 0.001]
    (KMeansDriver/main
     (->> (vector "-i"  in-path
                  "-c"  clusters-path
                  "-o"  out-path
                  "-dm" distance-measure
                  "-x"  max-iterations
                  "-k"  k
                  "-cd" convergence-delta
            ...