Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
About the Authors
About the Reviewers


In this chapter, we introduced Apache Pig, a platform for large-scale data analysis on Hadoop. In particular, we covered the following topics:

  • The goals of Pig as a way of providing a dataflow-like abstraction that does not require hands-on MapReduce development

  • How Pig's approach to processing data compares to SQL, where Pig is procedural while SQL is declarative

  • Getting started with Pig — an easy task, as it is a library that generates custom code and doesn't require additional services

  • An overview of the data types, core functions, and extension mechanisms provided by Pig

  • Examples of applying Pig to analyze the Twitter dataset in detail, which demonstrated its ability to express complex concepts in a very concise fashion

  • How libraries such as Piggybank, Elephant Bird, and DataFu provide repositories for numerous useful prewritten Pig functions

  • In the next chapter, we will revisit the SQL comparison by exploring tools that expose a SQL-like abstraction over data stored in HDFS