We will not evaluate Indo-European tokenizers like the other components of LingPipe with measures such as precision and recall. Instead, we will develop them with unit tests, because our tokenizers are heuristically constructed and expected to perform perfectly on example data—if a tokenizer fails to tokenize a known case, then it is a bug, not a reduction in performance. Why is this? There are a few reasons:
Many tokenizers are very "mechanistic" and are amenable to the rigidity of the unit test framework. For example, the
RegExTokenizerFactory
is obviously a candidate to unit test rather than an evaluation harness.The heuristic rules that drive most tokenizers are very general, and there is no issue of over-fitting training data at the expense of a deployed system. If you have a known bad case, you can just go and fix the tokenizer and add a unit test.
Tokens and white spaces are assumed to be semantically neutral, which means that tokens don't change...