Apache Spark 2.x Cookbook

Apache Spark 2.x Cookbook

By : Rishi Yadav

Buy this Book

Apache Spark 2.x Cookbook

By: Rishi Yadav

Buy this Book

Overview of this book

While Apache Spark 1.x gained a lot of traction and adoption in the early years, Spark 2.x delivers notable improvements in the areas of API, schema awareness, Performance, Structured Streaming, and simplifying building blocks to build better, faster, smarter, and more accessible big data applications. This book uncovers all these features in the form of structured recipes to analyze and mature large and complex sets of data. Starting with installing and configuring Apache Spark with various cluster managers, you will learn to set up development environments. Further on, you will be introduced to working with RDDs, DataFrames and Datasets to operate on schema aware data, and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will also work through recipes on machine learning, including supervised learning, unsupervised learning & recommendation engines in Spark. Last but not least, the final few chapters delve deeper into the concepts of graph processing using GraphX, securing your implementations, cluster optimization, and troubleshooting.

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Getting Started with Apache Spark

Introduction

Leveraging Databricks Cloud

Deploying Spark using Amazon EMR

Installing Spark from binaries

Building the Spark source code with Maven

Launching Spark on Amazon EC2

Deploying Spark on a cluster in standalone mode

Deploying Spark on a cluster with Mesos

Deploying Spark on a cluster with YARN

Understanding SparkContext and SparkSession

Understanding resilient distributed dataset - RDD

Developing Applications with Spark

Introduction

Exploring the Spark shell

Developing a Spark applications in Eclipse with Maven

Developing a Spark applications in Eclipse with SBT

Developing a Spark application in IntelliJ IDEA with Maven

Developing a Spark application in IntelliJ IDEA with SBT

Developing applications using the Zeppelin notebook

Setting up Kerberos to do authentication

Enabling Kerberos authentication for Spark

Spark SQL

Understanding the evolution of schema awareness

Understanding the Catalyst optimizer

Inferring schema using case classes

Programmatically specifying the schema

Understanding the Parquet format

Loading and saving data using the JSON format

Loading and saving data from relational databases

Loading and saving data from an arbitrary source

Understanding joins

Analyzing nested structures

Working with External Data Sources

Introduction

Loading data from the local filesystem

Loading data from HDFS

Loading data from Amazon S3

Loading data from Apache Cassandra

Spark Streaming

Introduction

WordCount using Structured Streaming

Taking a closer look at Structured Streaming

Streaming Twitter data

Streaming using Kafka

Understanding streaming challenges

Getting Started with Machine Learning

Introduction

Creating vectors

Calculating correlation

Understanding feature engineering

Understanding Spark ML

Understanding hyperparameter tuning

Supervised Learning with MLlib — Regression

Introduction

Using linear regression

Understanding the cost function

Doing linear regression with lasso

Doing ridge regression

Supervised Learning with MLlib — Classification

Introduction

Doing classification using logistic regression

Doing binary classification using SVM

Doing classification using decision trees

Doing classification using random forest

Doing classification using gradient boosted trees

Doing classification with Naïve Bayes

Unsupervised Learning

Introduction

Clustering using k-means

Dimensionality reduction with principal component analysis

Dimensionality reduction with singular value decomposition

Recommendations Using Collaborative Filtering

Introduction

Collaborative filtering using explicit feedback

Collaborative filtering using implicit feedback

Graph Processing Using GraphX and GraphFrames

Introduction

Fundamental operations on graphs

Using PageRank

Finding connected components

Performing neighborhood aggregation

Understanding GraphFrames

Optimizations and Performance Tuning

Optimizing memory

Leveraging speculation

Optimizing joins

Using compression to improve performance

Using serialization to improve performance

Optimizing the level of parallelism

Understanding project Tungsten

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Deploying Spark using Amazon EMR

There is a reason why deploying Spark on Amazon EMR is added as one of the first recipes in this edition of the book. The majority of the production deployments of Spark happen on EMR (in fact, the majority, and increasingly so, big data deployments happen on EMR). If you understand this recipe well, you may skip the rest of the recipes in this chapter, unless you are doing an on-premises deployment.

Note

Since this topic is of paramount importance in the current context, a lot more theory is being provided than what a typical cookbook would have. You can skip the theory section and directly go to the How to do it.. section, but I encourage you not to do so.

What it represents is much bigger than what it looks

What EMR represents is far more than meets the eye. Most of the enterprise workloads are migrating to public clouds at an accelerated pace. Once migrated, these workloads get rearchitected to leverage cloud-based services as opposed to simply using it as Infrastructure as a Service (IaaS). EC2 is an IaaS compute service of AWS, while EMR is the leading Platform as a Service (PaaS) service of AWS, with more big data workloads running on EMR than the alternatives combined.

EMR's architecture

Hadoop's core feature is data locality, that is, taking compute to where the data is. AWS disrupts this concept by separating storage and compute. AWS has multiple storage options, including the following:

Amazon S3: S3 is general-purpose object storage.
Amazon Redshift: This is a distributed cloud data warehouse.
Amazon DynamoDB: This is a NoSQL database.
Amazon Aurora: This is a cloud-based relational database.

Amazon S3 is the cheapest and most reliable cloud storage available, and this makes it the first choice, unless there is a compelling reason not to do so. EMR also supports attaching elastic block storage (EBS) volumes to compute instances (EC2) in order to provide a lower latency option.

Which option to choose depends upon what type of cluster is being created. There are two types of clusters:

Persistent cluster: It runs 24 x 7. Here, there is a continuous analysis of data for use cases such as fraud detection in the financial industry or clickstream analytics in ad tech. For these purposes, HDFS mounted on EBS is a good choice.
Transient cluster: Here, workloads are run inconsistently, for example, genome sequencing or holiday surge in retail. In this case, the cluster is only spawned when needed, making Elastic Map Reduce File System (EMRFS) based on S3 a better choice.

How to do it...

Log in to https://aws.amazon.com with your credentials.
Click on Services and select/search for EMR:

Click on Create cluster and select the last option in the Applications option box:

Click on Create Cluster and the cluster will start as follows:

Once the cluster is created with the given configuration, the My Cluster status will change to Waiting, as shown in the following screenshot:

Now add a step to select the JAR file; it takes the input file from the S3 location and produces the output file and stores it in the desired S3 bucket:

The wordcount step's status will change to completed, indicating a successful completion of the step, as shown in the following screenshot:

The output will be created and stored in the given S3 location. Here, it is in the output folder under the io.playground bucket:

How it works...

Let's look at the options shown in step 3:

Cluster name: This is where you provide an appropriate name for the cluster.
S3 folder: This is the folder location where the S3 bucket's logs for this cluster will go to.
Launch mode:
- Cluster: The cluster will continue to run until you terminate it.
- Step execution: This is to add steps after the application is launched.
Software configuration:
- Vendor: This is Amazon EMI with the open source Hadoop versus MapR's version.
- Release: This is self-evident.
- Applications:
  - Core Hadoop: This is focused on the SQL interface.
  - HBase: This is focused on partial no-SQL-oriented workloads.
  - Presto: This is focused on ad-hoc query processing.
  - Spark: This is focused on Spark.
Hardware configuration:
- Instance type: This topic will be covered in detail in the next section.
- Number of instances: This refers to the number of nodes in the cluster. One of them will be the master node and the rest slave nodes.
Security and access:
- EC2 key pair: You can associate an EC2 key pair with the cluster that you can use to connect to it via SSH.
- Permissions: You can allow other users besides the default Hadoop user to submit jobs.
- EMR role: This allows EMR to call other AWS services, such as EC2, on your behalf.
- EC2 instance profile: This provides access to other AWS services, such as S3 and DynamoDB, via the EC2 instances that are launched by EMR.

EC2 instance types

EC2 instances are the most expensive part of a company's AWS bill. So, selecting the right instance type is the key through which you can optimize your bill. The following section is a quick overview of the different instance types. Instance types, both in the cloud and on premises, are defined by four factors:

Number of cores
Memory
Storage (size and type)
Network performance

To see a quick illustration of how these factors affect each other, visit http://youtube.com/infoobjects.

In the EC2 world, these factors have been modified slightly to vCPU. vCPU is a virtualized unit of:

Memory
Storage (size and type)
Network performance

Instance type families are defined by the ratio of these factors, especially vCPU to memory. In a given family, this ratio remains unchanged (T2 excluded). Different instance families serve different purposes, almost like different types of automobiles. In fact, we are going to use the automobile metaphor in this section to illustrate these families.

T2 - Free Tier Burstable (EBS only)

The T2 instance type is a gateway drug in the AWS world, the reason being it belongs to Free Tier. Developers who sign up for AWS get this instance type for up to a year. This tier has six subtypes:

Instance Type	vCPUs	CPU Credit/Hr	Memory (GiB)
t2.micro	1	6	1
t2.small	1	12	2
t2.medium	2	24	4
t2.large	2	36	6
t2.xlarge	4	54	6
t2.2xlarge	8	81	32

M4 - General purpose (EBS only)

M4 is the instance type you use when in doubt. Developers who sign up for AWS get this instance type for up to a year. This tier has six subtypes:

Instance Type	vCPUs	Memory (GiB)	Dedicated Bandwidth
m4.large	2	8	450 mbps
m4.xlarge	4	16	750 mbps
m4.2xlarge	8	32	1,000 mbps
m4.4xlarge	16	64	2,000 mbps
m4.10xlarge	40	160	4,000 mbps
m4.16xlarge	64	256	10,000 mbps

C4 - Compute optimized

This tier has five subtypes:

Instance Type	vCPUs	Memory (GiB)	Dedicated Bandwidth
c4.large	2	3.75	500 mbps
c4.xlarge	4	7.5	750 mbps
c4.2xlarge	8	15	1,000 mbps
c4.4xlarge	16	30	2,000 mbps
c4.8xlarge	36	60	4,000 mbps

X1 - Memory optimized

This tier has two subtypes:

Instance Type	vCPUs	Memory (GiB)	Dedicated Bandwidth
x1.16xlarge	2	8	450 mbps
x1.32xlarge	4	16	750 mbps

R4 - Memory optimized

This tier has six subtypes:

Instance Type	vCPUs	Memory (GiB)	Dedicated Bandwidth
r4.large	2	15.25	10 gbps
r4.xlarge	4	30.5	10 gbps
r4.2xlarge	8	61	10 gbps
r4.4xlarge	16	122	10 gbps
r4.8xlarge	32	244	10 gbps
r4.16xlarge	64	488	20 gbps

P2 - General purpose GPU

This tier has three subtypes:

Instance Type	vCPUs	Memory (GiB)	GPUs	GPU Memory (GiB)
p2.xlarge	4	61	1	12
p2.8xlarge	32	488	8	96
p2.16xlarge	64	732	16	192

I3 - Storage optimized

This tier has six subtypes:

Instance Type	vCPUs	Memory (GiB)	Storage (GB)
i3.large	2	15.25	475 NVMe SSD
i3.xlarge	4	30.5	950 NVMe SSD
i3.2xlarge	8	61	1,900 NVMe SSD
i3.4xlarge	16	122	2x1,900 NVMe SSD
i3.8xlarge	32	244	4x1,900 NVMe SSD
i3.16xlarge	64	488	8x1,900 NVMe SSD

D2 - Storage optimized

This tier is for massively parallel processing (MPP), data warehouse, and so on type usage. This tier has four subtypes:

Instance Type	vCPUs	Memory (GiB)	Storage (GB)
d2.xlarge	4	30.5	3x2000 HDD
d2.2xlarge	8	61	6x2000 HDD
d2.4xlarge	16	122	12x2000 HDD
d2.8xlarge	32	244	24x2000 HDD

Apache Spark 2.x Cookbook

By : Rishi Yadav

Apache Spark 2.x Cookbook

By: Rishi Yadav

Overview of this book

Related Content you might be interested in

Current Title:

Apache Spark 2.x Cookbook

Learning Spark SQL

Scala and Spark for Big Data Analytics

Learning Apache Spark 2

Deploying Spark using Amazon EMR

Note

What it represents is much bigger than what it looks

EMR's architecture

How to do it...

How it works...

EC2 instance types

T2 - Free Tier Burstable (EBS only)

M4 - General purpose (EBS only)

C4 - Compute optimized

X1 - Memory optimized

R4 - Memory optimized

P2 - General purpose GPU

I3 - Storage optimized

D2 - Storage optimized