Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Scaling document frequencies by document size


While raw token frequencies can be useful, they often have one major problem: comparing frequencies with different documents is complicated if the document sizes are not the same. If the word customer appears 23 times in a 500-word document and it appears 40 times in a 1,000-word document, which one do you think is more focused on that word? It's difficult to say.

To work around this, it's common to scale the tokens frequencies for each document by the size of the document. That's what we'll do in this recipe.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

We'll use the token frequencies that we figured from the Getting document frequencies recipe. We'll keep them bound to the name token-freqs.

How to do it…

The function...