Natural Language Processing with Java

Natural Language Processing with Java - Second Edition

By : Richard M. Reese

Buy this Book

Natural Language Processing with Java - Second Edition

By: Richard M. Reese

Buy this Book

Overview of this book

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes. You’ll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you’ll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You’ll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You’ll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more. By the end of this book, you’ll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.

Title Page

Dedication

Packt Upsell

Contributors

Preface

Free Chapter

Introduction to NLP

Deep learning for Java

Overview of text-processing tasks

Understanding NLP models

Preparing data

Summary

Finding Parts of Text

Understanding the parts of text

What is tokenization?

Simple Java tokenizers

NLP tokenizer APIs

Understanding normalization

Summary

Finding Sentences

The SBD process

What makes SBD difficult?

Understanding the SBD rules of LingPipe's HeuristicSentenceModel class

Simple Java SBDs

Using NLP APIs

Training a sentence-detector model

Summary

Finding People and Things

Why is NER difficult?

Techniques for name recognition

Using regular expressions for NER

Using NLP APIs

Building a new dataset with the NER annotation tool

Training a model

Summary

Detecting Part of Speech

The tagging process

Using the NLP APIs

Summary

Representing Text with Features

N-grams

Word embedding

GloVe

Word2vec

Dimensionality reduction

Principle component analysis

Distributed stochastic neighbor embedding

Summary

Information Retrieval

Boolean retrieval

Dictionaries and tolerant retrieval

Vector space model

Scoring and term weighting

Inverse document frequency

TF-IDF weighting

Evaluation of information retrieval systems

Summary

Classifying Texts and Documents

How classification is used

Understanding sentiment analysis

Text-classifying techniques

Using APIs to classify text

Summary

Topic Modeling

What is topic modeling?

The basics of LDA

Topic modeling with MALLET

Summary

Using Parsers to Extract Relationships

Relationship types

Understanding parse trees

Using extracted relationships

Extracting relationships

Using NLP APIs

Extracting relationships for a question-answer system

Summary

Combined Pipeline

Preparing data

Using boilerpipe to extract text from HTML

Using POI to extract text from Word documents

Using PDFBox to extract text from PDF documents

Using Apache Tika for content analysis and extraction

Pipelines

Using the Stanford pipeline

Using multiple cores with the Stanford pipeline

Creating a pipeline to search text

Summary

Creating a Chatbot

Chatbot architecture

Artificial Linguistic Internet Computer Entity

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Survey of NLP tools

There are many tools available that support NLP. Some of these are available with the Java SE SDK but are limited in their utility for all but the simplest types of problems. Other libraries, such as Apache's OpenNLP and LingPipe, provide extensive and sophisticated support for NLP problems.

Low-level Java support includes string libraries, such as String, StringBuilder, and StringBuffer. These classes possess methods that perform searching, matching, and text-replacement. Regular expressions use special encoding to match substrings. Java provides a rich set of techniques to use regular expressions.

As discussed earlier, tokenizers are used to split text into individual elements. Java provides supports for tokenizers with:

The String class' split method
The StreamTokenizer class
The StringTokenizer class

There also exist a number of NLP libraries/APIs for Java. A partial list of Java-based NLP APIs can be found in the following table. Most of these are open source. In addition, there are a number of commercial APIs available. We will focus on the open source APIs:

API	URL
Apertium	http://www.apertium.org/
General Architecture for Text Engineering	http://gate.ac.uk/
Learning Based Java	https://github.com/CogComp/lbjava
LingPipe	http://alias-i.com/lingpipe/
MALLET	http://mallet.cs.umass.edu/
MontyLingua	http://web.media.mit.edu/~hugo/montylingua/
Apache OpenNLP	http://opennlp.apache.org/
UIMA	http://uima.apache.org/
Stanford Parser	http://nlp.stanford.edu/software
Apache Lucene Core	https://lucene.apache.org/core/
Snowball	http://snowballstem.org/

Many of these NLP tasks are combined to form a pipeline. A pipeline consists of various NLP tasks, which are integrated into a series of steps to achieve a processing goal. Examples of frameworks that support pipelines are General Architecture for Text Engineering (GATE) and Apache UIMA.

In the next section, we will cover several NLP APIs in more depth. A brief overview of their capabilities will be presented along with a list of useful links for each API.

Apache OpenNLP

The Apache OpenNLP project is a machine-learning-based tool kit for processing natural-language text; it addresses common NLP tasks and will be used throughout this book. It consists of several components that perform specific tasks, permit models to be trained, and support for testing the models. The general approach, used by OpenNLP, is to instantiate a model that supports the task from a file and then executes methods against the model to perform a task.

For example, in the following sequence, we will tokenize a simple string. For this code to execute properly, it must handle the FileNotFoundException and IOException exceptions. We use a try-with-resource block to open a FileInputStream instance using the en-token.bin file. This file contains a model that has been trained using English text:

try (InputStream is = new FileInputStream( 
        new File(getModelDir(), "en-token.bin"))){ 
    // Insert code to tokenize the text 
} catch (FileNotFoundException ex) { 
    ... 
} catch (IOException ex) { 
    ... 
}

An instance of the TokenizerModel class is then created using this file inside the try block. Next, we create an instance of the Tokenizer class, as shown here:

TokenizerModel model = new TokenizerModel(is); 
Tokenizer tokenizer = new TokenizerME(model);

The tokenize method is then applied, whose argument is the text to be tokenized. The method returns an array of String objects:

String tokens[] = tokenizer.tokenize("He lives at 1511 W." 
  + "Randolph.");

A for-each statement displays the tokens, as shown here. The open and closed brackets are used to clearly identify the tokens:

for (String a : tokens) { 
  System.out.print("[" + a + "] "); 
} 
System.out.println();

When we execute this, we will get the following output:

[He] [lives] [at] [1511] [W.] [Randolph] [.]

In this case, the tokenizer recognized that W. was an abbreviation and that the last period was a separate token demarking the end of the sentence.

We will use the OpenNLP API for many of the examples in this book. OpenNLP links are listed in the following table:

OpenNLP	Website
Home	https://opennlp.apache.org/
Documentation	https://opennlp.apache.org/docs/
Javadoc	http://nlp.stanford.edu/nlp/javadoc/javanlp/index.html
Download	https://opennlp.apache.org/cgi-bin/download.cgi
Wiki	https://cwiki.apache.org/confluence/display/OPENNLP/Index%3bjsessionid=32B408C73729ACCCDD071D9EC354FC54

Stanford NLP

The Stanford NLP Group conducts NLP research and provides tools for NLP tasks. The Stanford CoreNLP is one of these toolsets. In addition, there are other toolsets, such as the Stanford Parser, Stanford POS tagger, and the Stanford Classifier. The Stanford tools support English and Chinese languages and basic NLP tasks, including tokenization and name-entity recognition.

These tools are released under the full GPL, but it does not allow them to be used in commercial applications, though a commercial license is available. The API is well-organized and supports the core NLP functionality.

There are several tokenization approaches supported by the Stanford group. We will use the PTBTokenizer class to illustrate the use of this NLP library. The constructor demonstrated here uses a Reader object, a LexedTokenFactory<T> argument, and a string to specify which of the several options is to be used.

LexedTokenFactory is an interface that is implemented by the CoreLabelTokenFactory and WordTokenFactory classes. The former class supports the retention of the beginning and ending character positions of a token, whereas the latter class simply returns a token as a string without any positional information. The WordTokenFactory class is used by default.

The CoreLabelTokenFactory class is used in the following example. A StringReader is created using a string. The last argument is used for the option parameter, which is null for this example. The Iterator interface is implemented by the PTBTokenizer class, allowing us to use the hasNext and next methods to display the tokens:

PTBTokenizer ptb = new PTBTokenizer( 
new StringReader("He lives at 1511 W. Randolph."), 
new CoreLabelTokenFactory(), null); 
while (ptb.hasNext()) { 
  System.out.println(ptb.next()); 
}

The output is as follows:

He
lives
at
1511
W.
Randolph
.

We will use the Stanford NLP library extensively in this book. A list of Stanford links is found in the following table. Documentation and download links are found in each of the distributions:

Stanford NLP	Website
Home	http://nlp.stanford.edu/index.shtml
CoreNLP	http://nlp.stanford.edu/software/corenlp.shtml#Download
Parser	http://nlp.stanford.edu/software/lex-parser.shtml
POS Tagger	http://nlp.stanford.edu/software/tagger.shtml
java-nlp-user mailing list	https://mailman.stanford.edu/mailman/listinfo/java-nlp-user

LingPipe

LingPipe consists of a set of tools to perform common NLP tasks. It supports model training and testing. There are both royalty-free and licensed versions of the tool. The production use of the free version is limited.

To demonstrate the use of LingPipe, we will illustrate how it can be used to tokenize text using the Tokenizer class. Start by declaring two lists, one to hold the tokens and a second to hold the whitespace:

List<String> tokenList = new ArrayList<>(); 
List<String> whiteList = new ArrayList<>();

Note

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

Next, declare a string to hold the text to be tokenized:

String text = "A sample sentence processed \nby \tthe " + 
    "LingPipe tokenizer.";

Now, create an instance of the Tokenizer class. As shown in the following code block, a static tokenizer method is used to create an instance of the Tokenizer class based on an Indo-European factory class:

Tokenizer tokenizer = IndoEuropeanTokenizerFactory.INSTANCE. 
tokenizer(text.toCharArray(), 0, text.length());

The tokenize method of this class is then used to populate the two lists:

tokenizer.tokenize(tokenList, whiteList);

Use a for-each statement to display the tokens:

for(String element : tokenList) { 
  System.out.print(element + " "); 
} 
System.out.println();

The output of this example is shown here:

A sample sentence processed by the LingPipe tokenizer

A list of LingPipe links can be found in the following table:

LingPipe	Website
Home	http://alias-i.com/lingpipe/index.html
Tutorials	http://alias-i.com/lingpipe/demos/tutorial/read-me.html
JavaDocs	http://alias-i.com/lingpipe/docs/api/index.html
Download	http://alias-i.com/lingpipe/web/install.html
Core	http://alias-i.com/lingpipe/web/download.html
Models	http://alias-i.com/lingpipe/web/models.html

GATE

GATE is a set of tools written in Java and developed at the University of Sheffield in England. It supports many NLP tasks and languages. It can also be used as a pipeline for NLP-processing. It supports an API along with GATE Developer, a document viewer that displays text along with annotations. This is useful for examining a document using highlighted annotations. GATE Mimir, a tool for indexing and searching text generated by various sources, is also available. Using GATE for many NLP tasks involves a bit of code. GATE Embedded is used to embed GATE functionality directly in the code. Useful GATE links are listed in the following table:

Gate	Website
Home	https://gate.ac.uk/
Documentation	https://gate.ac.uk/documentation.html
JavaDocs	http://jenkins.gate.ac.uk/job/GATE-Nightly/javadoc/
Download	https://gate.ac.uk/download/
Wiki	http://gatewiki.sf.net/

TwitIE is an open source GATE pipeline for information-extraction over tweets. It contains the following:

Social media data-language identification
Twitter tokenizer for handling smileys, username, URLs, and so on
POS tagger
Text-normalization

It is available as part of the GATE Twitter plugin. The following table lists the required links:

TwitIE

Website

Home

https://gate.ac.uk/wiki/twitie.html

Documentation

https://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf?m=1

UIMA

The Organization for the Advancement of Structured Information Standards (OASIS) is a consortium focused on information-oriented business technologies. It developed the Unstructured Information Management Architecture (UIMA) standard as a framework for NLP pipelines. It is supported by Apache UIMA.

Although it supports pipeline creation, it also describes a series of design patterns, data representations, and user roles for the analysis of text. UIMA links are listed in the following table:

Apache UIMA	Website
Home	https://uima.apache.org/
Documentation	https://uima.apache.org/documentation.html
JavaDocs	https://uima.apache.org/d/uimaj-2.6.0/apidocs/index.html
Download	https://uima.apache.org/downloads.cgi
Wiki	https://cwiki.apache.org/confluence/display/UIMA/Index

Apache Lucene Core

Apache Lucene Core is an open source library for full-featured text search engines written in Java. It uses tokenization for breaking text into small chunks for indexing elements. It also provide pre- and post-tokenization options for analysis purposes. It supports stemming, filtering, text-normalization, and synonym-expansion after tokenization. When used, it creates a directory and index files, and can be used to search the contents. It cannot be taken as an NLP toolkit, but it provides powerful tools for working with text and advanced string-manipulation with tokenization. It provides a free search engine. The following table list the important links for Apache Lucene:

Apache Lucene	Website
Home	http://lucene.apache.org/
Documentation	http://lucene.apache.org/core/documentation.html
JavaDocs	http://lucene.apache.org/core/7_3_0/core/index.html
Download	http://lucene.apache.org/core/mirrors-core-latest-redir.html?

Natural Language Processing with Java - Second Edition

By : Richard M. Reese

Natural Language Processing with Java - Second Edition

By: Richard M. Reese

Overview of this book

Related Content you might be interested in

Current Title:

Natural Language Processing with Java - Second Edition

Natural Language Processing with Java Cookbook

Java for Data Science

Java Data Science Cookbook

Survey of NLP tools

Apache OpenNLP

Stanford NLP

LingPipe

Note

GATE

UIMA

Apache Lucene Core