Understanding multi-way JOIN ordering optimization
Spark SQL optimizer's heuristics can transform a SELECT
statement into a query plan with the following characteristics:
- The filter operator and project operator are pushed down below the join operator, that is, both the filter and project operators are executed before the join operator
- Without subquery block, the join operator is pushed down below the aggregate operator for a select statement, that is, a join operator is usually executed before the aggregate operator
With this observation, the biggest benefit we can get from CBO is multi-way join ordering optimization. Using a dynamic programming technique, we try to get globally optimal join order for a multi-way join query.
Note
For more details on multi-way join reordering in Spark 2.2, refer to https://spark-summit.org/2017/events/cost-based-optimizer-in-apache-spark-22/.
Clearly, the join cost is the dominant factor in choosing the best join order. The cost formula is dependent on the implementation...