Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Introduction to tokenizer factories – finding words in a character stream


LingPipe tokenizers are built on a common pattern of a base tokenizer that can be used on its own, or can be as the source for subsequent filtering tokenizers. Filtering tokenizers manipulate the tokens/white spaces provided by the base tokenizer. This recipe covers our most commonly used tokenizer, IndoEuropeanTokenizerFactory, which is good for languages that use the Indo-European style of punctuation and word separators—examples include English, Spanish, and French. As always, the Javadoc has useful information.

Note

IndoEuropeanTokenizerFactory creates tokenizers with built-in support for alpha-numerics, numbers, and other common constructs in Indo-European languages.

The tokenization rules are roughly based on those used in MUC-6 but are necessarily more fine grained, because the MUC tokenizers are based on lexical and semantic information, such as whether a string is an abbreviation.

MUC-6 refers to the Message...