Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester
Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Table of Contents (19 chapters)
Clojure Data Analysis Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Performing topic modeling with MALLET


Previously in this chapter, we looked at a number of ways to programmatically see what's present in documents. We saw how to identify people, places, dates, and other things in documents. We saw how to break things up into sentences.

Another, more sophisticated way to discover what's in a document is to use topic modeling. Topic modeling attempts to identify a set of topics that are contained in the document collection. Each topic is a cluster of words that are used together throughout the corpus. These clusters are found in individual documents to varying degrees, and a document is composed of several topics to varying extents. We'll take a look at this in more detail in the explanation for this recipe.

To perform topic modeling, we'll use MALLET (http://mallet.cs.umass.edu/). This is a library and utility that implements topic modeling in addition to several other document classification algorithms.

Getting ready

For this recipe, we'll need these lines...