Book Image

Python Data Analysis

By : Ivan Idris
Book Image

Python Data Analysis

By: Ivan Idris

Overview of this book

Table of Contents (22 chapters)
Python Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Key Concepts
Online Resources
Index

Analyzing word frequencies


The NLTK FreqDist class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out stopwords and punctuation:

punctuation = set(string.punctuation)
filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]

Create a FreqDist object and print associated keys and values with highest frequency:

fd = nltk.FreqDist(filtered)
print "Words", fd.keys()[:5]
print "Counts", fd.values()[:5]

The keys and values are printed as follows:

Words ['d', 'caesar', 'brutus', 'bru', 'haue']
Counts [215, 190, 161, 153, 148]

The first word in this list is of course not an English word, so we may need to add the heuristic that words have a minimum of two characters. The NLTK FreqDist class allows dictionary-like access, but it also has convenience methods. Get the word with the most frequent word and the related count:

print "Max", fd.max()
print "Count", fd['d...