Some of the included corpora contain parsed sentences, which are often deep trees of nested phrases. Unfortunately, these trees are too deep to use for training a chunker, since IOB tag parsing is not designed for nested chunks. To make these trees usable for chunker training, we must flatten them.
We're going to use the first parsed sentence of the treebank
corpus as our example. Here's a diagram showing how deeply nested this tree is:
You may notice that the part-of-speech tags are part of the tree structure instead of being included with the word. This will be handled later using the Tree.pos()
method, which was designed specifically for combining words with preterminal Tree
labels such as part-of-speech tags.