Book Image

Mastering Python for Data Science

By : Samir Madhavan
Book Image

Mastering Python for Data Science

By: Samir Madhavan

Overview of this book

Table of Contents (19 chapters)
Mastering Python for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
7
Estimating the Likelihood of Events
Index

Pig


Pig is a platform that has a very expressive language to perform data transformations and querying. The code that is written in Pig is done in a scripting manner and this gets compiled to MapReduce programs, which execute on Hadoop. The following image is the logo of Pig Latin:

The Pig logo

Pig helps in reducing the complexity of raw-level MapReduce programs, and enables the user to perform fast transformations.

Pig Latin is the textual language that can be learned from http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html.

We'll be covering how to perform the top 10 most occurring words with Pig, and then we'll see how you can create a function in Python that can be used in Pig.

Let's start with the word count. Here is the Pig Latin code, which you can save in the pig_wordcount.py file:

data = load '/tmp/moby_dick/';
word_token = foreach data generate flatten(TOKENIZE((chararray)$0)) as word;
group_word_token = group word_token by word;
count_word_token = foreach group_word_token generate...