In this chapter, we will focus on the performance tuning aspects of Spark SQL-based components. The Spark SQL Catalyst optimizer is central to the efficient execution of many, if not all, Spark applications, including ML Pipelines,Structured Streaming, and GraphFrames-based applications. We will first explain the key foundational aspects regarding serialization/deserialization using encoders and the logical and physical plans associated with query executions, and then present the details of the cost-based optimization (CBO) feature released in Spark 2.2. Additionally, we will present some tips and tricks that developers can use to improve the performance of their applications throughout the chapter.
More specifically, in this chapter, you will learn the following:
- Basic concepts essential to understanding performance tuning
- Understanding Spark internals that drives performance
- Understanding cost-based optimizations
- Understanding the performance...