Apache Spark is an open-source framework for fast, big data or large-scale processing with the support for streaming, SQL, Machine learning, and graph processing. This framework is implemented in Scala and supports programming languages such as Java, Scala, and Python. The magnitude of performance is up to 10X to 20X is the traditional Hadoop stack. Spark is a general purpose framework and allows interactive programming along with the support for streaming. Spark can work with Hadoop supporting Hadoop formats like SequenceFiles or InputFormats in a standalone mode. It includes local file systems, Hive, HBase, Cassandra, and Amazon S3 among others.
We will use Spark 1.2.0 for all the examples throughout this book.
The following figure depicts the core modules of Apache Spark:
Some of the basic functions of Spark framework include task scheduling, interaction with storage systems, fault tolerance, and memory management. Spark follows a programming paradigm called Resilient Distributed...