Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Tokenizing text


Before we can do any real analysis of a text or a corpus of texts, we have to identify the words in the text. This process is called tokenization. The output of this process is a list of words, and possibly includes punctuation in a text. This is different from tokenizing formal languages such as programming languages: it is meant to work with natural languages and its results are less structured.

It's easy to write your own tokenizer, but there are a lot of edge and corner cases to take into consideration and account for. It's also easy to include a natural language processing (NLP) library that includes one or more tokenizers. In this recipe, we'll use the OpenNLP (http://opennlp.apache.org/) and its Clojure wrapper (https://clojars.org/clojure-opennlp).

Getting ready

We'll need to include the clojure-opennlp in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3...