Book Image

Natural Language Processing with Java - Second Edition

By : Richard M. Reese
Book Image

Natural Language Processing with Java - Second Edition

By: Richard M. Reese

Overview of this book

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes. You’ll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you’ll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You’ll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You’ll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more. By the end of this book, you’ll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.
Table of Contents (19 chapters)
Title Page
Dedication
Packt Upsell
Contributors
Preface
Index

Why is NLP so hard?


NLP is not easy. There are several factors that make this process hard. For example, there are hundreds of natural languages, each of which has different syntax rules. Words can be ambiguous where their meaning is dependent on their context. Here, we will examine a few of the more significant problem areas.

At the character level, there are several factors that need to be considered. For example, the encoding scheme used for a document needs to be considered. Text can be encoded using schemes such as ASCII, UTF-8, UTF-16, or Latin-1. Other factors, such as whether the text should be treated as case-sensitive or not, may need to be considered. Punctuation and numbers may require special processing. We sometimes need to consider the use of emoticons (character combinations and special character images), hyperlinks, repeated punctuation (... or ---), file extensions, and usernames with embedded periods. Many of these are handled by preprocessing text, as we will discuss in the Preparing data section.

When we tokenize text, it usually means we are breaking up the text into a sequence of words. These words are called tokens. The process is referred to as tokenization. When a language uses whitespace characters to delineate words, this process is not too difficult. With a language such as Chinese, it can be quite difficult since it uses unique symbols for words.

Words and morphemes may need to be assigned a Part-of-Speech (POS) label, identifying what type of unit it is. A morpheme is the smallest division of text that has meaning. Prefixes and suffixes are examples of morphemes. Often, we need to consider synonyms, abbreviation, acronyms, and spellings when we work with words.

Stemming is another task that may need to be applied. Stemming is the process of finding the word stem of a word. For example, words such as walking, walked, or walks have the word stem walk. Search engines often use stemming to assist in asking a query.

Closely related to stemming is the process of lemmatization. This process determines the base form of a word, called its lemma. For example, for the word operating, its stem is oper but its lemma is operate. Lemmatization is a more refined process than stemming, and uses vocabulary and morphological techniques to find a lemma. This can result in more precise analysis in some situations.

Words are combined into phrases and sentences. Sentence detection can be problematic and is not as simple as looking for the periods at the end of a sentence. Periods are found in many places, including abbreviations such as Ms., and in numbers such as 12.834.

We often need to understand which words in a sentence are nouns and which are verbs. We are often concerned with the relationship between words. For example, coreferences resolution determines the relationship between certain words in one or more sentences. Consider the following sentence:

"The city is large but beautiful. It fills the entire valley."

The word it is the coreference to city. When a word has multiple meanings, we might need to perform word-sense disambiguation (WSD) to determine the intended meaning. This can be difficult to do at times. For example, "John went back home." Does the home refer to a house, a city, or some other unit? Its meaning can sometimes be inferred from the context in which it is used. For example, "John went back home. It was situated at the end of a cul-de-sac."

Note

Despite these difficulties, NLP is able to perform these tasks reasonably well in most situations and provide added value to many problem domains. For example, sentiment analysis can be performed on customer tweets, resulting in possible free product offers for dissatisfied customers. Medical documents can be readily summarized to highlight the relevant topics and improved productivity.Summarization is the process of producing a short description of different units. These units can include multiple sentences, paragraphs, a document, or multiple documents. The intent may be to identify those sentences that convey the meaning of the unit, determine the prerequisites for understanding a unit, or to find items within these units. Frequently, the context of the text is important in accomplishing this task.