Using custom tokenizers
While Flair ships with several tokenizers that support the most commonly spoken languages, it is entirely possible you will be working with a language that uses tokenization rules currently not covered by Flair. Luckily, Flair offers a simple interface that allows us to implement our tokenizers or use third-party libraries.
Using the TokenizerWrapper class
The TokenizerWrapper
class provides an easy interface for building custom tokenizers. To build one, you simply need to instantiate the class by passing the tokenizer_func
parameter. The parameter is a function that receives the entire sentence text as input and returns a list of token strings.
As an exercise, let's try to implement a custom tokenizer that splits the text into characters. This tokenizer will treat every character as a token:
from flair.data import Token from flair.tokenization import TokenizerWrapper def char_splitter(sentence): return list(sentence...