In this section, we will learn how to detect a shuffle in a process.
In this section, we will cover the following topics:
- Loading randomly partitioned data
- Issuing repartition using a meaningful partition key
- Understanding how shuffle occurs by explaining a query
We will load randomly partitioned data to see how and where the data is loaded. Next, we will issue a partition using a meaningful partition key. We will then repartition data to the proper executors using the deterministic and meaningful key. In the end, we will explain our queries by using the explain() method and understand the shuffle. Here, we have a very simple test.
We will create a DataFrame with some data. For example, we created an InputRecord with some random UID and user_1, and another input with random ID in user_1, and the last record for user_2. Let's imagine that...