HmmChunker
uses an HMM to perform chunking over tokenized character sequences. Instances contain an HMM decoder for the model and tokenizer factory. The chunker requires the states of the HMM to conform to a token-by-token encoding of a chunking. It uses the tokenizer factory to break the chunks down into sequences of tokens and tags. Refer to the Hidden Markov Models (HMM) – part of speech recipe in Chapter 4, Tagging Words and Tokens.
We'll look at training HmmChunker
and using it for the CoNLL2002
Spanish task. You can and should use your own data, but this recipe assumes that training data will be in the CoNLL2002
format.
Training is done using an ObjectHandler
which supplies the training instances.