Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Apache Spark


Apache Spark (https://spark.apache.org/) is a data processing framework based on a generalization of MapReduce. It was originally developed by the AMPLab at UC Berkeley (https://amplab.cs.berkeley.edu/). Like Tez, Spark acts as an execution engine that models data transformations as DAGs and strives to eliminate the I/O overhead of MapReduce in order to perform iterative computation at scale. While Tez's main goal was to provide a faster execution engine for MapReduce on Hadoop, Spark has been designed both as a standalone framework and an API for application development. The system is designed to perform general-purpose in-memory data processing, stream workflows, as well as interactive and iterative computation.

Spark is implemented in Scala, which is a statically typed programming language for the Java VM and exposes native programming interfaces for Java and Python in addition to Scala itself. Note that though Java code can call the Scala interface directly, there are some...