We mentioned in the previous recipe that LingPipe tokenizers can be basic or filtered. Basic tokenizers, such as the Indo-European tokenizer, don't need much in terms of parameterization, none at all as a matter of fact. However, filtered tokenizers need a tokenizer as a parameter. What we're doing with filtered tokenizers is invoking multiple tokenizers where a base tokenizer is usually modified by a filter to produce a different tokenizer.
LingPipe provides several basic tokenizers, such as IndoEuropeanTokenizerFactory
or CharacterTokenizerFactory
. A complete list can be found in the Javadoc for LingPipe. In this section, we'll show you how to combine an Indo-European tokenizer with a lowercase tokenizer. This is a fairly common process that many search engines implement for Indo-European languages.
You will need to download the JAR file for the book and have Java and Eclipse set up so that you can run the example.