Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Fundamentals of Apache Pig


The primary interface to program Apache Pig is Pig Latin, a procedural language that implements ideas of the dataflow paradigm.

Pig Latin programs are generally organized as follows:

  • A LOAD statement reads data from HDFS

  • A series of statements aggregates and manipulates data

  • A STORE statement writes output to the filesystem

  • Alternatively, a DUMP statement displays the output to the terminal

The following example shows a sequence of statements that outputs the top 10 hashtags ordered by the frequency, extracted from the dataset of tweets:

tweets = LOAD 'tweets.json' 
  USING JsonLoader('created_at:chararray, 
    id:long, 
    id_str:chararray, 
    text:chararray');

hashtags = FOREACH tweets {
  GENERATE FLATTEN(
    REGEX_EXTRACT(
      text, 
      '(?:\\s|\\A|^)[##]+([A-Za-z0-9-_]+)', 1)
    ) as tag;
}

hashtags_grpd = GROUP hashtags BY tag;
hashtags_count = FOREACH hashtags_grpd {
  GENERATE 
    group, 
    COUNT(hashtags) as occurrencies; 
}
hashtags_count_sorted...