Book Image

Clojure for Data Science

By : Henry Garner
Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Table of Contents (18 chapters)
Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Better clustering with TF-IDF


Term Frequency-Inverse Document Frequency (TF-IDF) is a general approach to weighting terms within a document vector so that terms that are popular across the whole dataset are not weighted as highly as terms that are less usual. This captures the intuitive conviction—and what we observed earlier—that words such as "said" are not a strong basis for building clusters.

Zipf's law

Zipf's law states that the frequency of any word is inversely proportional to its rank in the frequency table. Thus, the most frequent word will occur approximately twice as often as the second most frequent word and three times as often as the next most frequent word, and so on. Let's see if this applies across our Reuters corpus:

(defn ex-6-13 []
  (let [documents (fs/glob "data/reuters-text/*.txt")
        doc-count 1000
        top-terms 25
        term-frequencies (->> (map slurp documents)
                              (remove too-short?)
                              (take...