At the end of the previous chapter, Chapter 4, Part-of-speech Tagging, we introduced NLTK-Trainer and the train_tagger.py
script. In this recipe, we will cover the script for training chunkers: train_chunker.py
.
Note
You can find NLTK-Trainer at https://github.com/japerk/nltk-trainer and the online documentation at http://nltk-trainer.readthedocs.org/.
As with train_tagger.py
, the only required argument to train_chunker.py
is the name of a corpus. In this case, we need a corpus that provides a chunked_sents()
method, such as treebank_chunk
. Here's an example of running train_chunker.py
on treebank_chunk
:
$ python train_chunker.py treebank_chunk loading treebank_chunk 4009 chunks, training on 4009 training ub TagChunker evaluating TagChunker ChunkParse score: IOB Accuracy: 97.0% Precision: 90.8% Recall: 93.9% F-Measure: 92.3% dumping TagChunker to /Users/jacob/nltk_data/chunkers/treebank_chunk_ub.pickle
Just...