Book Image

Natural Language Processing with Java and LingPipe Cookbook

Book Image

Natural Language Processing with Java and LingPipe Cookbook

Overview of this book

Table of Contents (14 chapters)
Natural Language Processing with Java and LingPipe Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

HMM-based NER


HmmChunker uses an HMM to perform chunking over tokenized character sequences. Instances contain an HMM decoder for the model and tokenizer factory. The chunker requires the states of the HMM to conform to a token-by-token encoding of a chunking. It uses the tokenizer factory to break the chunks down into sequences of tokens and tags. Refer to the Hidden Markov Models (HMM) – part of speech recipe in Chapter 4, Tagging Words and Tokens.

We'll look at training HmmChunker and using it for the CoNLL2002 Spanish task. You can and should use your own data, but this recipe assumes that training data will be in the CoNLL2002 format.

Training is done using an ObjectHandler which supplies the training instances.

Getting ready

As we want to train this chunker, we need to either label some data using the Computational Natural Language Learning (CoNLL) schema or use the one that's publicly available. For speed, we'll choose to get a corpus that is available in the CoNLL 2002 task.

Note

The...