Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

The Spark ecosystem


Apache Spark powers a number of tools, both as a library and as an execution engine.

Spark Streaming

Spark Streaming (found at http://spark.apache.org/docs/latest/streaming-programming-guide.html) is an extension of the Scala API that allows data ingestion from streams such as Kafka, Flume, Twitter, ZeroMQ, and TCP sockets.

Spark Streaming receives live input data streams and divides the data into batches (arbitrarily sized time windows), which are then processed by the Spark core engine to generate the final stream of results in batches. This high-level abstraction is called DStream (org.apache.spark.streaming.dstream.DStreams) and is implemented as a sequence of RDDs. DStream allows for two kinds of operations: transformations and output operations. Transformations work on one or more DStreams to create new DStreams. As part of a chain of transformations, data can be persisted either to a storage layer (HDFS) or an output channel. Spark Streaming allows for transformations...