Book Image

Clojure for Data Science

By : Henry Garner
Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Table of Contents (18 chapters)
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Creating term frequency vectors


To calculate the Euclidean distance, let's first create a vector from our dictionary and document. This will allow us to easily compare the term frequencies between documents because they will occupy the same index of the vector.

(defn term-id [dict term]
  (get-in @dict [:terms term]))

(defn term-frequencies [dict terms]
  (->> (map #(term-id dict %) terms)
       (remove nil?)
       (frequencies)))

(defn map->vector [dictionary id-counts]
  (let [zeros (vec (replicate (:count @dictionary) 0))]
    (-> (reduce #(apply assoc! %1 %2) (transient zeros) id-counts)
        (persistent!))))

(defn tf-vector [dict document]
  (map->vector dict (term-frequencies dict document)))

The term-frequencies function creates a map of term ID to frequency count for each term in the document. The map->vector function simply takes this map and associates the frequency count at the index of the vector given by the term ID. Since there may be many terms, and...