Book Image

Natural Language Processing with Java

By : Richard M. Reese , Richard M Reese
Book Image

Natural Language Processing with Java

By: Richard M. Reese , Richard M Reese

Overview of this book

Table of Contents (15 chapters)
Natural Language Processing with Java
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Understanding the parts of text


There are a number of ways of categorizing parts of text. For example, we may be concerned with character-level issues such as punctuations with a possible need to ignore or expand contractions. At the word level, we may need to perform different operations such as:

  • Identifying morphemes using stemming and/or lemmatization

  • Expanding abbreviations and acronyms

  • Isolating number units

We cannot always split words with punctuations because the punctuations are sometimes considered to be part of the word, such as the word "can't". We may also be concerned with grouping multiple words to form meaningful phrases. Sentence detection can also be a factor. We do not necessarily want to group words that cross sentence boundaries.

In this chapter, we are primarily concerned with the tokenization process and a few specialized techniques such as stemming. We will not attempt to show how they are used in other NLP tasks. Those efforts are reserved for later chapters.