As previously mentioned in the Training a unigram part-of-speech tagger recipe, using a custom model with a UnigramTagger
class should only be done if you know exactly what you're doing. In this recipe, we're going to create a model for the most common words, most of which always have the same tag no matter what.
To find the most common words, we can use nltk.probability.FreqDist
to count word frequencies in the treebank
corpus. Then, we can create a ConditionalFreqDist
class for tagged words, where we count the frequency of every tag for every word. Using these counts, we can construct a model of the 200 most frequent words as keys, with the most frequent tag for each word as a value. Here's the model creation function defined in tag_util.py
.
from nltk.probability import FreqDist, ConditionalFreqDist def word_tag_model(words, tagged_words, limit=200): fd = FreqDist(words) cfd = ConditionalFreqDist(tagged_words) most_freq = (word for...