Book Image

Natural Language Processing with Java - Second Edition

By : Richard M. Reese
Book Image

Natural Language Processing with Java - Second Edition

By: Richard M. Reese

Overview of this book

Natural Language Processing (NLP) allows you to take any sentence and identify patterns, special names, company names, and more. The second edition of Natural Language Processing with Java teaches you how to perform language analysis with the help of Java libraries, while constantly gaining insights from the outcomes. You’ll start by understanding how NLP and its various concepts work. Having got to grips with the basics, you’ll explore important tools and libraries in Java for NLP, such as CoreNLP, OpenNLP, Neuroph, and Mallet. You’ll then start performing NLP on different inputs and tasks, such as tokenization, model training, parts-of-speech and parsing trees. You’ll learn about statistical machine translation, summarization, dialog systems, complex searches, supervised and unsupervised NLP, and more. By the end of this book, you’ll have learned more about NLP, neural networks, and various other trained models in Java for enhancing the performance of NLP applications.
Table of Contents (19 chapters)
Title Page
Dedication
Packt Upsell
Contributors
Preface
Index

What is tokenization?


Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need, at times, to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important:

Character

Meaning

Unicode space character

(space_separator, line_separator, or paragraph_separator)

\t

U+0009 horizontal tabulation

\n

U+000A line feed

\u000B

U+000B vertical tabulation

\f

U+000C form feed

\r

U+000D carriage return

\u001C

U+001C file separator

\u001D

U+001D group separator

\u001E

U+001E record separator

\u001F

U+001F unit separator

 

The tokenization process is complicated by a large...