Book Image

Mastering Apache Spark 2.x - Second Edition

Book Image

Mastering Apache Spark 2.x - Second Edition

Overview of this book

Apache Spark is an in-memory, cluster-based Big Data processing system that provides a wide range of functionalities such as graph processing, machine learning, stream processing, and more. This book will take your knowledge of Apache Spark to the next level by teaching you how to expand Spark’s functionality and build your data flows and machine/deep learning programs on top of the platform. The book starts with a quick overview of the Apache Spark ecosystem, and introduces you to the new features and capabilities in Apache Spark 2.x. You will then work with the different modules in Apache Spark such as interactive querying with Spark SQL, using DataFrames and DataSets effectively, streaming analytics with Spark Streaming, and performing machine learning and deep learning on Spark using MLlib and external tools such as H20 and Deeplearning4j. The book also contains chapters on efficient graph processing, memory management and using Apache Spark on the cloud. By the end of this book, you will have all the necessary information to master Apache Spark, and use it efficiently for Big Data processing and analytics.
Table of Contents (21 chapters)
Title Page
About the Author
About the Reviewer
Customer Feedback
Deep Learning on Apache Spark with DeepLearning4j and H2O


Apache Spark is an in-memory, cluster-based, parallel processing system that provides a wide range of functionality such as graph processing, machine learning, stream processing, and SQL. This book aims to take your limited knowledge of Spark to the next level by teaching you how to expand your Spark functionality. The book opens with an overview of the Spark ecosystem. The book will introduce you to Project Catalyst and Tungsten. You will understand how Memory Management and Binary Processing, Cache-aware Computation, and Code Generation are used to speed things up dramatically. The book goes on to show how to incorporate H20 and Deeplearning4j for machine learning and Juypter Notebooks, Zeppelin, Docker and Kubernetes for cloud-based Spark. During the course of the book, you will also learn about the latest enhancements in Apache Spark 2.2, such as using the DataFrame and Dataset APIs exclusively, building advanced, fully automated Machine Learning pipelines with SparkML and perform graph analysis using the new GraphFrames API.

What this book covers

Chapter 1, A First Taste and What's New in Apache Spark V2, provides an overview of Apache Spark, the functionality that is available within its modules, and how it can be extended. It covers the tools available in the Apache Spark ecosystem outside the standard Apache Spark modules for processing and storage. It also provides tips on performance tuning.

Chapter 2, Apache Spark SQL, creates a schema in Spark SQL, shows how data can be queried efficiently using the relational API on DataFrames and Datasets, and explores SQL.

Chapter 3, The Catalyst Optimizer, explains what a cost-based optimizer in database systems is and why it is necessary. You will master the features and limitations of the Catalyst Optimizer in Apache Spark.

Chapter 4, Project Tungsten, explains why Project Tungsten is essential for Apache Spark and also goes on to explain how Memory Management, Cache-aware Computation, and Code Generation are used to speed things up dramatically.

Chapter 5, Apache Spark Streaming, talks about continuous applications using Apache Spark streaming. You will learn how to incrementally process data and create actionable insights.

Chapter 6, Structured Streaming, talks about Structured Streaming – a new way of defining continuous applications using the DataFrame and Dataset APIs.

Chapter 7, Classical MLlib, introduces you to MLlib, the de facto standard for machine learning when using Apache Spark.

Chapter 8, Apache SparkML, introduces you to the DataFrame-based machine learning library of Apache Spark: the new first-class citizen when it comes to high performance and massively parallel machine learning.

Chapter 9, Apache SystemML, introduces you to Apache SystemML, another machine learning library capable of running on top of Apache Spark and incorporating advanced features such as a cost-based optimizer, hybrid execution plans, and low-level operator re-writes.

Chapter 10, Deep Learning on Apache Spark using H20 and DeepLearning4j, explains that deep learning is currently outperforming one traditional machine learning discipline after the other. We have three open source first-class deep learning libraries running on top of Apache Spark, which are H2O, DeepLearning4j, and Apache SystemML. Let's understand what Deep Learning is and how to use it on top of Apache Spark using these libraries.

Chapter 11, Apache Spark GraphX, talks about Graph processing with Scala using GraphX. You will learn some basic and also advanced graph algorithms and how to use GraphX to execute them.

Chapter 12, Apache Spark GraphFrames, discusses graph processing with Scala using GraphFrames. You will learn some basic and also advanced graph algorithms and also how GraphFrames differ from GraphX in execution.

Chapter 13, Apache Spark with Jupyter Notebooks on IBM DataScience Experience, introduces a Platform as a Service offering from IBM, which is completely based on an Open Source stack and on open standards. The main advantage is that you have no vendor lock-in. Everything you learn here can be installed and used in other clouds, in a local datacenter, or on your local laptop or PC.

Chapter 14, Apache Spark onKubernetes, explains that Platform as a Service cloud providers completely manage the operations part of an Apache Spark cluster for you. This is an advantage but sometimes you have to access individual cluster nodes for debugging and tweaking and you don't want to deal with the complexity that maintaining a real cluster on bare-metal or virtual systems entails. Here, Kubernetes might be the best solution. Therefore, in this chapter, we explain what Kubernetes is and how it can be used to set up an Apache Spark cluster.

What you need for this book

You will need the following to work with the examples in this book:

  • A laptop or PC with at least 6 GB main memory running Windows, macOS, or Linux
  • VirtualBox 5.1.22 or above
  • Hortonworks HDP Sandbox V2.6 or above
  • Eclipse Neon or above
  • Maven
  • Eclipse Maven Plugin
  • Eclipse Scala Plugin
  • Eclipse Git Plugin

Who this book is for

This book is an extensive guide to Apache Spark from the programmer's and data scientist's perspective. It covers Apache Spark in depth, but also supplies practical working examples for different domains. Operational aspects are explained in sections on performance tuning and cloud deployments. All the chapters have working examples, which can be replicated easily.


In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The next lines of code read the link and assign it to the to the BeautifulSoup function."

A block of code is set as follows:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

Any command-line input or output is written as follows:

[hadoop@hc2nn ~]# sudo su -
[root@hc2nn ~]# cd /tmp

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter."


Warnings or important notes appear in a box like this.


Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at If you purchased this book elsewhere, you can visit and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

  1. Log in or register to our website using your e-mail address and password.
  2. Hover the mouse pointer on the SUPPORT tab at the top.
  1. Click on Code Downloads & Errata.
  2. Enter the name of the book in the Search box.
  3. Select the book for which you're looking to download the code files.
  4. Choose from the drop-down menu where you purchased this book from.
  5. Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows
  • Zipeg / iZip / UnRarX for Mac
  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at We also have other code bundles from our rich catalog of books and videos available at Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from


Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to and enter the name of the book in the search field. The required information will appear under the Errata section.


Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.


If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.