Book Image

Natural Language Processing with Java Cookbook

By : Richard M. Reese
Book Image

Natural Language Processing with Java Cookbook

By: Richard M. Reese

Overview of this book

Natural Language Processing (NLP) has become one of the prime technologies for processing very large amounts of unstructured data from disparate information sources. This book includes a wide set of recipes and quick methods that solve challenges in text syntax, semantics, and speech tasks. At the beginning of the book, you'll learn important NLP techniques, such as identifying parts of speech, tagging words, and analyzing word semantics. You will learn how to perform lexical analysis and use machine learning techniques to speed up NLP operations. With independent recipes, you will explore techniques for customizing your existing NLP engines/models using Java libraries such as OpenNLP and the Stanford NLP library. You will also learn how to use NLP processing features from cloud-based sources, including Google and Amazon Web Services (AWS). You will master core tasks, such as stemming, lemmatization, part-of-speech tagging, and named entity recognition. You will also learn about sentiment analysis, semantic text similarity, language identification, machine translation, and text summarization. By the end of this book, you will be ready to become a professional NLP expert using a problem-solution approach to analyze any sort of text, sentence, or semantic word.
Table of Contents (14 chapters)

Preparing Text for Analysis and Tokenization

One of the first steps required for Natural Language Processing (NLP) is the extraction of tokens in text. The process of tokenization splits text into tokens—that is, words. Normally, tokens are split based upon delimiters, such as white space. White space includes blanks, tabs, and carriage-return line feeds. However, specialized tokenizers can split tokens according to other delimiters. In this chapter, we will illustrate several tokenizers that you will find useful in your analysis.

Another important NLP task involves determining the stem and lexical meaning of a word. This is useful for deriving more meaning about the words beings processed, as illustrated in the fifth and sixth recipe. The stem of a word refers to the root of a word. For example, the stem of the word antiquated is antiqu. While this may not seem to be the correct stem, the stem of a word is the ultimate base of the word.

The lexical meaning of a word is not concerned with the context in which it is being used. We will be examining the process of performing lemmatization of a word. This is also concerned with finding the root of a word, but uses a more detailed dictionary to find the root. The stem of a word may vary depending on the form the word takes. However, with lemmatization, the root will always be the same. Stemming is often used when we will be satisfied with possibly a less than precise determination of the root of a word. A more thorough discussion of stemming versus lemmatization can be found at: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/.

The last task in this chapter deals with the process of text normalization. Here, we are concerned with converting the token that is extracted to a form that can be more easily processed during later analysis. Typical normalization activities include converting cases, expanding abbreviations, removing stop words along with stemming, and lemmatization. Stop words are those words that can often be ignored with certain types of analyses. For example, in some contexts, the word the does not always need to be included.

In this chapter, we will cover the following recipes:

  • Tokenization using the Java SDK
  • Tokenization using OpenNLP
  • Tokenization using maximum entropy
  • Training a neural network tokenizer for specialized text
  • Identifying the stem of a word
  • Training an OpenNLP lemmatization model
  • Determining the lexical meaning of a word using OpenNLP
  • Removing stop words using LingPipe