Understanding Catalyst optimizations
We briefly explored the optimizer in Chapter 1, Getting Started with Spark SQL. Basically, Catalyst has an internal representation of the user's program, called the query plan. A set of transformations is executed on the initial query plan to yield the optimized query plan. Finally, through Spark SQL's generation mechanism, the optimized query plan gets converted to a DAG of RDDs, ready for execution. At its core, the Catalyst optimizer defines the abstractions of users' programs as trees and also the transformations from one tree to another.
In order to take advantage of optimization opportunities, we need an optimizer that automatically finds the most efficient plan to execute data operations (specified in the user's program). In the context of this chapter, Spark SQL's Catalyst optimizer acts as the interface between the user's high-level programming constructs and the low-level execution plans.
Understanding the Dataset/DataFrame API
A Dataset or a...