Book Image

Natural Language Processing with Java

By : Richard M. Reese , Richard M Reese
Book Image

Natural Language Processing with Java

By: Richard M. Reese , Richard M Reese

Overview of this book

Table of Contents (15 chapters)
Natural Language Processing with Java
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Summary


In this chapter, we illustrated various approaches to tokenize text and perform normalization on text. We started with simple tokenization technique based on core Java classes such as the String class' split method and the StringTokenizer class. These approaches can be useful when we decide to forgo the use of NLP API classes.

We demonstrated how tokenization can be performed using the OpenNLP, Stanford, and LingPipe APIs. We found there are variations in how tokenization can be performed and in options that can be applied in these APIs. A brief comparison of their outputs was provided.

Normalization was discussed, which can involve converting characters to lowercase, expanding abbreviation, removing stopwords, stemming, and lemmatization. We illustrated how these techniques can be applied using both core Java classes and the NLP APIs.

In the next chapter, we will investigate the issues involved with determining the end of sentences using various NLP APIs.