Spark computations are typically in-memory and be bottlenecked by the resources in the cluster: CPU, network bandwidth, or memory. In addition, although the data fits in memory, network bandwidth may be challenging.
Note
Tuning Spark applications is a necessary step to reduce both the number and size of data transfer over the network and/or reduce the overall memory footprint of the computations.
In this chapter, we will focus our attention on Spark SQL Catalyst because it is key to deriving benefits from a whole set of application components.
Spark SQL is at the heart of enhancements to Spark recently, including ML Pipelines, Structured Streaming, and GraphFrames. The following figure illustrates the role Spark SQL plays the Spark Core and the higher-level APIs on top of it:
In the next several sections, we will the fundamental understanding required for tuning Spark SQL applications. We will start with the DataFrame/Dataset APIs.