Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Mapping documents to a sparse vector space representation


Many text algorithms deal with vector space representations of the documents. This means that the documents are normalized into vectors. Each individual token type is assigned one position across all the documents' vectors. For instance, text might have position 42, so index 42 in all the document vectors will have the frequency (or other value) of the word text.

However, most documents won't have anything for most words. This makes them sparse vectors, and we can use more efficient formats for them.

The Colt library (http://acs.lbl.gov/ACSSoftware/colt/) contains implementations of sparse vectors. For this recipe, we'll see how to read a collection of documents into these.

Getting ready…

For this recipe, we'll need the following in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]
                 [colt/colt "1.2.0"]...