Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Apache Crunch


Apache Crunch (http://crunch.apache.org) is a Java and Scala library to create pipelines of MapReduce jobs. It is based on Google's FlumeJava (http://dl.acm.org/citation.cfm?id=1806638) paper and library. The project goal is to make the task of writing MapReduce jobs as straightforward as possible for anybody familiar with the Java programming language by exposing a number of patterns that implement operations such as aggregating, joining, filtering, and sorting records.

Similar to tools such as Pig, Crunch pipelines are created by composing immutable, distributed data structures and running all processing operations on such structures; they are expressed and implemented as user-defined functions. Pipelines are compiled into a DAG of MapReduce jobs, whose execution is managed by the library's planner. Crunch allows us to write iterative code and abstracts away the complexity of thinking in terms of map and reduce operations, while at the same time avoiding the need of an ad...