Book Image

Learning Apache Spark 2

Book Image

Learning Apache Spark 2

Overview of this book

Apache Spark has seen an unprecedented growth in terms of its adoption over the last few years, mainly because of its speed, diversity and real-time data processing capabilities. It has quickly become the preferred choice of tool for many Big Data professionals looking to find quick insights from large chunks of data. This book introduces you to the Apache Spark framework, and familiarizes you with all the latest features and capabilities introduced in Spark 2. Starting with a detailed introduction to Spark’s architecture and the installation procedure, this book covers everything you need to know about the Spark framework in the most practical manner. You will learn how to perform the basic ETL activities using Spark, and work with different components of Spark such as Spark SQL, as well as the Dataset and DataFrame APIs for manipulating your data. Then, you will perform machine learning using Spark MLlib, as well as perform streaming analytics and graph processing using the Spark Streaming and GraphX modules respectively. The book also gives special emphasis on deploying your Spark models, and how they can be operated in a clustered mode. During the course of the book, you will come across implementations of different real-world use-cases and examples, giving you the hands-on knowledge you need to use Apache Spark in the best possible manner.
Table of Contents (18 chapters)
Learning Apache Spark 2
Credits
About the Author
About the Reviewers
www.packtpub.com
Customer Feedback
Preface

Spark architecture


Let's have a look at Apache Spark architecture, including a high level overview and a brief description of some of the key software components.

High level overview

At the high level, Apache Spark application architecture consists of the following key software components and it is important to understand each one of them to get to grips with the intricacies of the framework:

  • Driver program
  • Master node
  • Worker node
  • Executor
  • Tasks
  • SparkContext
  • SQL context
  • Spark session

Here's an overview of how some of these software components fit together within the overall architecture:

Figure 1.15: Apache Spark application architecture - Standalone mode

Driver program

Driver program is the main program of your Spark application. The machine where the Spark application process (the one that creates SparkContext and Spark Session) is running is called the Driver node, and the process is called the Driver process. The driver program communicates with the Cluster Manager to distribute tasks to the executors.

Cluster Manager

A cluster manager as the name indicates manages a cluster, and as discussed earlier Spark has the ability to work with a multitude of cluster managers including YARN, Mesos and a Standalone cluster manager. A standalone cluster manager consists of two long running daemons, one on the master node, and one on each of the worker nodes. We'll talk more about the cluster managers and deployment models in Chapter 8, Operating in Clustered Mode.

Worker

If you are familiar with Hadoop, a Worker Node is something similar to a slave node. Worker machines are the machines where the actual work is happening in terms of execution inside Spark executors. This process reports the available resources on the node to the master node. Typically each node in a Spark cluster except the master runs a worker process. We normally start one spark worker daemon per worker node, which then starts and monitors executors for the applications.

Executors

The master allocates the resources and uses the workers running across the cluster to create Executors for the driver. The driver can then use these executors to run its tasks. Executors are only launched when a job execution starts on a worker node. Each application has its own executor processes, which can stay up for the duration of the application and run tasks in multiple threads. This also leads to the side effect of application isolation and non-sharing of data between multiple applications. Executors are responsible for running tasks and keeping the data in memory or disk storage across them.

Tasks

A task is a unit of work that will be sent to one executor. Specifically speaking, it is a command sent from the driver program to an executor by serializing your Function object. The executor deserializes the command (as it is part of your JAR that has already been loaded) and executes it on a partition.

A partition is a logical chunk of data distributed across a Spark cluster. In most cases Spark would be reading data out of a distributed storage, and would partition the data in order to parallelize the processing across the cluster. For example, if you are reading data from HDFS, a partition would be created for each HDFS partition. Partitions are important because Spark will run one task for each partition. This therefore implies that the number of partitions are important. Spark therefore attempts to set the number of partitions automatically unless you specify the number of partitions manually e.g. sc.parallelize (data,numPartitions).

SparkContext

SparkContext is the entry point of the Spark session. It is your connection to the Spark cluster and can be used to create RDDs, accumulators, and broadcast variables on that cluster. It is preferable to have one SparkContext active per JVM, and hence you should call stop() on the active SparkContext before you create a new one. You might have noticed previously that in the local mode, whenever we start a Python or Scala shell we have a SparkContext object created automatically and the variable sc refers to the SparkContext object. We did not need to create the SparkContext, but instead started using it to create RDDs from text files.

Spark Session

Spark session is the entry point to programming with Spark with the dataset and DataFrame API.