Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Introduction


Before looking into various ways to optimize Spark, it is a good idea to look at the Spark internals. So far, we have looked at Spark at higher level, where focus was the functionality provided by the various libraries.

Let's start with redefining an RDD. Externally, an RDD is a distributed immutable collection of objects. Internally, it consists of the following five parts:

  • Set of partitions (rdd.getPartitions)

  • List of dependencies on parent RDDs (rdd.dependencies)

  • Function to compute a partition, given its parents

  • Partitioner (optional) (rdd.partitioner)

  • Preferred location of each partition (optional) (rdd.preferredLocations)

The first three are needed for an RDD to be recomputed, in case the data is lost. When combined, it is called lineage. The last two parts are optimizations.

A set of partitions is how data is divided into nodes. In case of HDFS, it means InputSplits, which are mostly the same as block (except when a record crosses block boundaries; in that case, it will be slightly...