Book Image

Mastering Python for Data Science

By : Samir Madhavan
Book Image

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Table of Contents (19 chapters)
Mastering Python for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
7
Estimating the Likelihood of Events
Index

Preprocessing data


We'll use the reviews of Mad Max: Fury Road from the online portals of BBC, Forbes, Guardian, and Movie Pilot.

We'll extensively use the Natural Language Toolkit (NLTK) package of Python in this chapter for text mining. You can install it with the help of instructions at http://www.nltk.org/install.html

We'll be performing the following actions on data:

  • Removing punctuation

  • Removing numbers

  • Converting text to lowercase

  • Removing the most common words in the English language, called stop words, such as be, the, on, and so on.

Let's start by loading the data first:

>>> data = {}

>>> #data['bbc'] =

>>> data['bbc'] = open('./Data/madmax_review/bbc.txt','r').read()

>>> data['forbes'] = open('./Data/madmax_review/forbes.txt','r').read()

>>> data['guardian'] = open('./Data/madmax_review/guardian.txt','r').read()

>>> data['moviepilot'] = open('./Data/madmax_review/moviepilot.txt','r').read()

>>> # We'll convert the...