In this section, we will test the operations that cause a shuffle in Apache Spark. We will cover the following topics:
- Using join for two DataFrames
- Using two DataFrames that are partitioned differently
- Testing a join that causes a shuffle
A join is a specific operation that causes shuffle, and we will use it to join our two DataFrames. We will first check whether it causes shuffle and then we will check how to avoid it. To understand this, we will use two DataFrames that are partitioned differently and check the operation of joining two datasets or DataFrames that are not partitioned or partitioned randomly. It will cause shuffle because there is no way to join two datasets with the same partition key if they are on different physical machines.
Before we join the dataset, we need to send them to the same physical machine...