In this section, we will change the job that was performing the join on non-partitioned data. We'll be changing the design of jobs with wide dependencies.
In this section, we will cover the following topics:
- Repartitioning DataFrames using a common partition key
- Understanding a join with pre-partitioned data
- Understanding that we avoided shuffle
We will be using the repartition method on the DataFrame using a common partition key. We saw that when issuing a join, repartitioning happens underneath. But often, when using Spark, we want to execute multiple operations on the DataFrame. So, when we perform the join with other datasets, hashPartitioning will need to be executed once again. If we do the partition at the beginning when the data is loaded, we will avoid partitioning again.
Here, we have our example test case,...