You can train your own named entity chunker using the
ieer
corpus, which stands for Information Extraction: Entity Recognition. It takes a bit of extra work, though, because the ieer
corpus has chunk trees but no part-of-speech tags for words.
Using the ieertree2conlltags()
and ieer_chunked_sents()
functions in chunkers.py
, we can create named entity chunk trees from the ieer
corpus to train the ClassifierChunker
class created in the Classification-based chunking recipe:
import nltk.tag from nltk.chunk.util import conlltags2tree from nltk.corpus import ieer def ieertree2conlltags(tree, tag=nltk.tag.pos_tag): words, ents = zip(*tree.pos()) iobs = [] prev = None for ent in ents: if ent == tree.label(): iobs.append('O') prev = None elif prev == ent: iobs.append('I-%s' % ent) else: iobs.append('B-%s' % ent) prev = ent words, tags = zip(*tag(words)) return zip(words, tags, iobs) def ieer_chunked_sents...