12. Optimizations and Performance Tuning | Spark Cookbook

Sign In Start Free Trial

Book Overview & Buying
Table Of Contents

Spark Cookbook

By : Yadav

4.4 (18)

Spark Cookbook

4.4 (18)

By: Yadav

Overview of this book

If you are a data engineer, an application developer, or a data scientist who would like to leverage the power of Apache Spark to get better insights from big data, then this is the book for you.

Preface

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Conventions

Reader feedback

Customer support

Free Chapter

1. Getting Started with Apache Spark

1. Getting Started with Apache Spark

Introduction

Installing Spark from binaries

Building the Spark source code with Maven

Launching Spark on Amazon EC2

Deploying on a cluster in standalone mode

Deploying on a cluster with Mesos

Deploying on a cluster with YARN

Using Tachyon as an off-heap storage layer

2. Developing Applications with Spark

2. Developing Applications with Spark

Introduction

Exploring the Spark shell

Developing Spark applications in Eclipse with Maven

Developing Spark applications in Eclipse with SBT

Developing a Spark application in IntelliJ IDEA with Maven

Developing a Spark application in IntelliJ IDEA with SBT

3. External Data Sources

3. External Data Sources

Introduction

Loading data from the local filesystem

Loading data from HDFS

Loading data from HDFS using a custom InputFormat

Loading data from Amazon S3

Loading data from Apache Cassandra

Loading data from relational databases

4. Spark SQL

4. Spark SQL

Introduction

Understanding the Catalyst optimizer

Creating HiveContext

Inferring schema using case classes

Programmatically specifying the schema

Loading and saving data using the Parquet format

Loading and saving data using the JSON format

Loading and saving data from relational databases

Loading and saving data from an arbitrary source

5. Spark Streaming

5. Spark Streaming

Introduction

Word count using Streaming

Streaming Twitter data

Streaming using Kafka

6. Getting Started with Machine Learning Using MLlib

6. Getting Started with Machine Learning Using MLlib

Introduction

Creating vectors

Creating a labeled point

Creating matrices

Calculating summary statistics

Calculating correlation

Doing hypothesis testing

Creating machine learning pipelines using ML

7. Supervised Learning with MLlib – Regression

7. Supervised Learning with MLlib – Regression

Introduction

Using linear regression

Understanding cost function

Doing linear regression with lasso

Doing ridge regression

8. Supervised Learning with MLlib – Classification

8. Supervised Learning with MLlib – Classification

Introduction

Doing classification using logistic regression

Doing binary classification using SVM

Doing classification using decision trees

Doing classification using Random Forests

Doing classification using Gradient Boosted Trees

Doing classification with Naïve Bayes

9. Unsupervised Learning with MLlib

9. Unsupervised Learning with MLlib

Introduction

Clustering using k-means

Dimensionality reduction with principal component analysis

Dimensionality reduction with singular value decomposition

10. Recommender Systems

10. Recommender Systems

Introduction

Collaborative filtering using explicit feedback

Collaborative filtering using implicit feedback

11. Graph Processing Using GraphX

11. Graph Processing Using GraphX

Introduction

Fundamental operations on graphs

Using PageRank

Finding connected components

Performing neighborhood aggregation

12. Optimizations and Performance Tuning

12. Optimizations and Performance Tuning

Introduction

Optimizing memory

Using compression to improve performance

Using serialization to improve performance

Optimizing garbage collection

Optimizing the level of parallelism

Understanding the future of optimization – project Tungsten

Index

Index

Optimizing the level of parallelism

Optimizing the level of parallelism is very important to fully utilize the cluster capacity. In the case of HDFS, it means that the number of partitions is the same as the number of InputSplits, which is mostly the same as the number of blocks.

In this recipe, we will cover different ways to optimize the number of partitions.

How to do it…

Specify the number of partitions when loading a file into RDD with the following steps:

Start the Spark shell:
```
$ spark-shell
```
Load the RDD with a custom number of partitions as a second parameter:
```
scala> sc.textFile("hdfs://localhost:9000/user/hduser/words",10)
```

Another approach is to change the default parallelism by performing the following steps:

Start the Spark shell with the new value of default parallelism:
```
$ spark-shell --conf spark.default.parallelism=10
```
Check the default value of parallelism:
```
scala> sc.defaultParallelism
```

Note

You can also reduce the number of partitions using an RDD method called coalesce(numPartitions...

CONTINUE READING

83

Tech Concepts

36

Programming languages

73

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Spark Cookbook

Search

Your notes and bookmarks