Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Getting document frequencies


One common and useful metric to work with text corpora is to get the counts of the tokens in the documents. This can be done quite easily by leveraging standard Clojure functions.

Let's see how.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of that, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

We'll also use tokenize, get-sentences, normalize, load-stopwords, and is-stopword from the earlier recipes.

We'll also use the value of the tokens that we saw in the Focusing on content words with stoplists recipe. Here it is again:

(def tokens
  (map #(remove is-stopword (normalize (tokenize %)))
       (get-sentences
         "I never saw a Purple Cow.
         I never hope to see one.
         But I can tell you, anyhow.
         I'd rather see than be one.")))

How to do it…

Of course, the standard...