Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Using Lucene/Solr tokenizers


The very popular search engine, Lucene, includes many analysis modules, which provide general purpose tokenizers as well as language-specific tokenizers from Arabic to Thai. As of Lucene 4, most of these different analyzers can be found in separate JAR files. We will cover Lucene tokenizers, because they can be used as LingPipe tokenizers, as you will see in the next recipe.

Much like the LingPipe tokenizers, Lucene tokenizers also can be split into basic tokenizers and filtered tokenizers. Basic tokenizers take a reader as input, and filtered tokenizers take other tokenizers as input. We will look at an example of using a standard Lucene analyzer along with a lowercase-filtered tokenizer. A Lucene analyzer essentially maps a field to a token stream. So, if you have an existing Lucene index, you can use the analyzer with the field name instead of the raw tokenizer, as we will show in the later part of this chapter.

Getting ready

You will need to download the JAR...