Understanding performance improvements using whole-stage code generation
In this section, we first present a high-level of whole-stage generation in Spark SQL, followed by a set of examples to show improvements in various JOINs
using Catalyst's code generation feature.
After we have an optimized query plan, it needs to be converted to a DAG of RDDs for execution on the cluster. We use this example to explain the basic concepts of Spark SQL whole-stage code generation:
scala> sql("select count(*) from orders where customer_id = 26333955").explain() == Optimized Logical Plan == Aggregate [count(1) AS count(1)#45L] +- Project +- Filter (isnotnull(customer_id#42L) && (customer_id#42L = 26333955)) +- Relation[customer_id#42L,good_id#43L] parquet
The preceding optimized logical plan can be viewed as a sequence of Scan, Filter, Project, and Aggregate operations, as shown in the following figure:
Traditional databases will typically execute the preceding...