We'll use the reviews of Mad Max: Fury Road from the online portals of BBC, Forbes, Guardian, and Movie Pilot.
We'll extensively use the Natural Language Toolkit (NLTK) package of Python in this chapter for text mining. You can install it with the help of instructions at http://www.nltk.org/install.html
We'll be performing the following actions on data:
Removing punctuation
Removing numbers
Converting text to lowercase
Removing the most common words in the English language, called stop words, such as
be
,the
,on
, and so on.
Let's start by loading the data first:
>>> data = {} >>> #data['bbc'] = >>> data['bbc'] = open('./Data/madmax_review/bbc.txt','r').read() >>> data['forbes'] = open('./Data/madmax_review/forbes.txt','r').read() >>> data['guardian'] = open('./Data/madmax_review/guardian.txt','r').read() >>> data['moviepilot'] = open('./Data/madmax_review/moviepilot.txt','r').read() >>> # We'll convert the...