Hadoop can be downloaded and installed from https://hadoop.apache.org/. We'll be using the Hadoop streaming API to execute our Python MapReduce program in Hadoop. The Hadoop Streaming API helps in using any program that has a standard input and output as a MapReduce program.
We'll be writing three MapReduce programs using Python, they are as follows:
A basic word count
Getting the sentiment Score of each review
Getting the overall sentiment score from all the reviews
We'll start with the word count MapReduce. Save the following code in a word_mapper.py
file:
import sys for l in sys.stdin: # Trailing and Leading white space is removed l = l.strip() # words in the line is split word_tokens = l.split() # Key Value pair is outputted for w in word_tokens: print '%s\t%s' % (w, 1)
In the preceding mapper code, each line of the file is stripped of the leading and trailing white spaces. The line is then divided into tokens of words and then...