A tokenizer is an analysis component declared with the <tokenizer>
element that takes text in the form of a character stream and splits it into so-called tokens, most of the time skipping insignificant bits like whitespace and joining punctuation. An analyzer has exactly one tokenizer. Your tokenizer choices are as follows:
KeywordTokenizerFactory
: This tokenizer doesn't actually do any tokenization! The entire character stream becomes a single token. Thestring
field type has a similar effect but doesn't allow configuration of text analysis like lower-casing, for example. Any field used for sorting or most uses of faceting will require an indexed field with no more than one term per original value.WhitespaceTokenizerFactory
: Text is tokenized by whitespace: spaces, tabs, carriage returns, line feeds.StandardTokenizerFactory
: This is a general-purpose tokenizer for most Western languages. It tokenizes on whitespace and other points specified by the Unicode standard's...