Sequence labeling metrics
When comparing one sequence tagger to another, we can't simply try them out by hand and take a wild guess about which one performs better. Their performance needs to be evaluated using the same dataset and computed using the same predefined metric. The most common metrics used in sequence labeling are accuracy and F1 score.
Measuring accuracy for sequence labeling tasks
Accuracy is a measure ranging from 0 to 1 that simply computes the proportion of correctly tagged tokens.
Assuming correctly_tagged_tokens
is the number of correctly tagged tokens and all_tokens
is the total number of all tokens, accuracy can be defined as:
The preceding formula is simple and provides an easily interpretable result, but the metric can be misleading when dealing with imbalanced datasets (datasets where classes/tag names are not represented equally). This is particularly noticeable with NER where the majority of tokens belongs to a single class. For example...