Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Scaling document frequencies with TF-IDF


In the last few recipes, we've seen how to generate term frequencies and scale them by the size of the document so that the frequencies from two different documents can be compared.

Term frequencies also have another problem. They don't tell you how important a term is, relative to all of the documents in the corpus.

To address this, we will use term frequency-inverse document frequency (TF-IDF). This metric scales the term's frequency in a document by the term's frequency in the entire corpus.

In this recipe, we'll assemble the parts needed to implement TF-IDF.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

We'll also use two functions that we've created earlier in this chapter. From the Tokenizing text recipe, we'll use tokenize...