Book Image

Natural Language Processing with Java

By : Richard M. Reese , Richard M Reese
Book Image

Natural Language Processing with Java

By: Richard M. Reese , Richard M Reese

Overview of this book

Table of Contents (15 chapters)
Natural Language Processing with Java
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

What is tokenization?


Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need at times to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important.

Character

Meaning

Unicode space character

(space_separator, line_separator, or paragraph_separator)

\t

U+0009 horizontal tabulation

\n

U+000A line feed

\u000B

U+000B vertical tabulation

\f

U+000C form feed

\r

U+000D carriage return

\u001C

U+001C file separator

\u001D

U+001D group separator

\u001E

U+001E record separator

\u001F

U+001F unit...