Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By : Jacob Perkins
Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By: Jacob Perkins

Overview of this book

Table of Contents (17 chapters)
Python 3 Text Processing with NLTK 3 Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Penn Treebank Part-of-speech Tags
Index

Training a tagger-based chunker


Training a chunker can be a great alternative to manually specifying regular expression chunk patterns. Instead of a pain-staking process of trial and error to get the exact right patterns, we can use existing corpus data to train chunkers much like we did for part-of-speech tagging in the previous chapter.

How to do it...

As with the part-of-speech tagging, we'll use the treebank corpus data for training. But this time, we'll use the treebank_chunk corpus, which is specifically formatted to produce chunked sentences in the form of trees. These chunked_sents() methods will be used by a TagChunker class to train a tagger-based chunker. The TagChunker class uses a helper function, conll_tag_chunks(), to extract a list of (pos, iob) tuples from a list of Trees. These (pos, iob) tuples are then used to train a tagger in the same way (word, pos) tuples were used in Chapter 4, Part-of-speech Tagging, to train part-of-speech taggers. But instead of learning part-of...