Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Translating between word tagging and chunks – BIO codec


In Chapter 4, Tagging Words and Tokens, we used HMMs and CRFs to apply tags to words/tokens. This recipe addresses the case of creating chunks from taggings that use the Begin, In, and Out (BIO) tags to encode chunkings that can span multiple words/tokens. This, in turn, is the basis of modern named-entity detection systems.

Getting ready

The standard BIO-tagging scheme has the first token in a chunk of type X tagged B-X (begin), with all the subsequent tokens in the same chunk tagged I-X (in). All the tokens that are not in chunks are tagged O (out). For example, the string with character counts:

John Jones Mary and Mr. Jones
01234567890123456789012345678
0         1         2         

It can be tagged as:

John  B_PERSON
Jones  I_PERSON
Mary  B_PERSON
and  O
Mr    B_PERSON
.    I_PERSON
Jones  I_PERSON

The corresponding chunks will be:

0-10 "John Jones" PERSON
11-15 "Mary" PERSON
20-29 "Mr. Jones" PERSON

How to do it…

The program will show...