Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Latent Dirichlet allocation (LDA) for multitopic clustering


Latent Dirichlet allocation (LDA) is a statistical technique to document clustering based on the tokens or words that are present in the document. Clustering such as classification generally assumes that categories are mutually exclusive. The neat thing about LDA is that it allows for documents to be in multiple topics at the same time, instead of just one category. This better reflects the fact that a tweet can be about Disney and Wally World, among other topics.

The other neat thing about LDA, like many clustering techniques, is that it is unsupervised, which means that no supervised training data is required! The closest thing to training data is that the number of topics must be specified before hand.

LDA can be a great way to explore a dataset where you don't know what you don't know. It can also be difficult to tune, but generally, it does something interesting. Let's get a system working.

For each document, LDA assigns a probability...