Book Image

Natural Language Processing with Java

By : Richard M. Reese , Richard M Reese
Book Image

Natural Language Processing with Java

By: Richard M. Reese , Richard M Reese

Overview of this book

Table of Contents (15 chapters)
Natural Language Processing with Java
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Why is NLP so hard?


NLP is not easy. There are several factors that makes this process hard. For example, there are hundreds of natural languages, each of which has different syntax rules. Words can be ambiguous where their meaning is dependent on their context. Here, we will examine a few of the more significant problem areas.

At the character level, there are several factors that need to be considered. For example, the encoding scheme used for a document needs to be considered. Text can be encoded using schemes such as ASCII, UTF-8, UTF-16, or Latin-1. Other factors such as whether the text should be treated as case-sensitive or not may need to be considered. Punctuation and numbers may require special processing. We sometimes need to consider the use of emoticons (character combinations and special character images), hyperlinks, repeated punctuation (… or ---), file extension, and usernames with embedded periods. Many of these are handled by preprocessing text as we will discuss in Preparing data later in the chapter.

When we Tokenize text, it usually means we are breaking up the text into a sequence of words. These words are called Tokens. The process is referred to as Tokenization. When a language uses whitespace characters to delineate words, this process is not too difficult. With a language like Chinese, it can be quite difficult since it uses unique symbols for words.

Words and morphemes may need to be assigned a part of speech label identifying what type of unit it is. A Morpheme is the smallest division of text that has meaning. Prefixes and suffixes are examples of morphemes. Often, we need to consider synonyms, abbreviation, acronyms, and spellings when we work with words.

Stemming is another task that may need to be applied. Stemming is the process of finding the word stem of a word. For example, words such as "walking", "walked", or "walks" have the word stem "walk". Search engines often use stemming to assist in asking a query.

Closely related to stemming is the process of Lemmatization. This process determines the base form of a word called its lemma. For example, for the word "operating", its stem is "oper" but its lemma is "operate". Lemmatization is a more refined process than stemming and uses vocabulary and morphological techniques to find a lemma. This can result in more precise analysis in some situations.

Words are combined into phrases and sentences. Sentence detection can be problematic and is not as simple as looking for the periods at the end of a sentence. Periods are found in many places including abbreviations such as Ms. and in numbers such as 12.834.

We often need to understand which words in a sentence are nouns and which are verbs. We are sometimes concerned with the relationship between words. For example, Coreferences resolution determines the relationship between certain words in one or more sentences. Consider the following sentence:

"The city is large but beautiful. It fills the entire valley."

The word "it" is the coreference to city. When a word has multiple meanings we might need to perform Word Sense Disambiguation to determine the meaning that was intended. This can be difficult to do at times. For example, "John went back home".

Does the home refer to a house, a city, or some other unit? Its meaning can sometimes be inferred from the context in which it is used. For example, "John went back home. It was situated at the end of a cul-de-sac."

Note

In spite of these difficulties, NLP is able to perform these tasks reasonably well in most situations and provide added value to many problem domains. For example, sentiment analysis can be performed on customer tweets resulting in possible free product offers for dissatisfied customers. Medical documents can be readily summarized to highlight the relevant topics and improved productivity.

Summarization is the process of producing a short description of different units. These units can include multiple sentences, paragraphs, a document, or multiple documents. The intent may be to identify those sentences that convey the meaning of the unit, determine the prerequisites for understanding a unit, or to find items within these units. Frequently, the context of the text is important in accomplishing this task.