In Chapter 4, Tagging Words and Tokens, we used HMMs and CRFs to apply tags to words/tokens. This recipe addresses the case of creating chunks from taggings that use the Begin, In, and Out (BIO) tags to encode chunkings that can span multiple words/tokens. This, in turn, is the basis of modern named-entity detection systems.
The standard BIO-tagging scheme has the first token in a chunk of type X tagged B-X (begin), with all the subsequent tokens in the same chunk tagged I-X (in). All the tokens that are not in chunks are tagged O (out). For example, the string with character counts:
John Jones Mary and Mr. Jones 01234567890123456789012345678 0 1 2
It can be tagged as:
John B_PERSON Jones I_PERSON Mary B_PERSON and O Mr B_PERSON . I_PERSON Jones I_PERSON
The corresponding chunks will be:
0-10 "John Jones" PERSON 11-15 "Mary" PERSON 20-29 "Mr. Jones" PERSON