Pig is a platform that has a very expressive language to perform data transformations and querying. The code that is written in Pig is done in a scripting manner and this gets compiled to MapReduce programs, which execute on Hadoop. The following image is the logo of Pig Latin:
Pig helps in reducing the complexity of raw-level MapReduce programs, and enables the user to perform fast transformations.
Pig Latin is the textual language that can be learned from http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html.
We'll be covering how to perform the top 10 most occurring words with Pig, and then we'll see how you can create a function in Python that can be used in Pig.
Let's start with the word count. Here is the Pig Latin code, which you can save in the pig_wordcount.py
file:
data = load '/tmp/moby_dick/'; word_token = foreach data generate flatten(TOKENIZE((chararray)$0)) as word; group_word_token = group word_token by word; count_word_token = foreach group_word_token generate...