Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Finding people, places, and things with Named Entity Recognition


One thing that's fairly easy to pull out of documents is named items. This includes things such as people's names, organizations, locations, and dates. These algorithms are called Named Entity Recognition (NER), and while they are not perfect, they're generally pretty good. Error rates under 0.1 are normal.

The OpenNLP library has classes to perform NER, and depending on what you train them with, they will identify people, locations, dates, or a number of other things. The clojure-opennlp library also exposes these classes in a good, Clojure-friendly way.

Getting ready

We'll continue building on the previous recipes in this chapter. Because of this, we'll use the same project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3.2"]])

From the Tokenizing text recipe, we'll use tokenize, and from the Focusing on content words with stoplists...