The NLTK FreqDist
class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out stopwords and punctuation:
punctuation = set(string.punctuation) filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]
Create a FreqDist
object and print associated keys and values with highest frequency:
fd = nltk.FreqDist(filtered) print "Words", fd.keys()[:5] print "Counts", fd.values()[:5]
The keys and values are printed as follows:
Words ['d', 'caesar', 'brutus', 'bru', 'haue'] Counts [215, 190, 161, 153, 148]
The first word in this list is of course not an English word, so we may need to add the heuristic that words have a minimum of two characters. The NLTK FreqDist
class allows dictionary-like access, but it also has convenience methods. Get the word with the most frequent word and the related count:
print "Max", fd.max() print "Count", fd['d...