Summary
In this chapter, we presented the foundational concepts related to tuning a Spark application, including data serialization using encoders. We also covered the key aspects of the cost-based optimizer introduced in Spark 2.2 to optimize Spark SQL execution automatically. Finally, we presented some examples of JOIN
operations, and the improvements in execution times as a result of using whole-stage code generation.
In the next chapter, we will explore application architectures that leverage Spark modules and Spark SQL in real-world applications. We will also describe the deployment of some of the main processing models being used for batch processing, streaming applications, and machine learning pipelines.