Book Image

Lucene 4 Cookbook

By : Edwood Ng, Vineeth Mohan
Book Image

Lucene 4 Cookbook

By: Edwood Ng, Vineeth Mohan

Overview of this book

Table of Contents (16 chapters)
Lucene 4 Cookbook
About the Authors
About the Reviewers

Defining custom tokenizers

Although there are several excellent built-in tokenizers in Lucene, you may still find yourself needing something to behave slightly differently. You will then have to custom-build a Tokenizer. Lucene provides a character-based tokenizer called CharTokenizer that should be suitable for most types of tokenizations. You can override its isTokenChar method to determine what characters should be considered as part of a token and what characters should be considered as delimiters. It's worthwhile to note that both LetterTokenizer and WhitespaceTokenizer extend from CharTokenizer.

How to do it…

In this example, we will create our own tokenizer that splits text by space only. It is similar to WhitespaceTokenizer but this one is simpler. Here is the sample code:

public class MyTokenizer extends CharTokenizer {

    public MyTokenizer(Reader input) {

    public MyTokenizer(AttributeFactory factory, Reader input) {
        super(factory, input)...