In this section, we will use a custom partitioner to reduce shuffle. We will cover the following topics:
- Implementing a custom partitioner
- Using the partitioner with the partitionBy method on Spark
- Validating that our data was partitioned properly
We will implement a custom partitioner with our custom logic, which will partition the data. It will inform Spark where each record should land and on which executor. We will be using the partitionBy method on Spark. In the end, we will validate that our data was partitioned properly. For the purposes of this test, we are assuming that we have two executors:
import com.tomekl007.UserTransaction
import org.apache.spark.sql.SparkSession
import org.apache.spark.{Partitioner, SparkContext}
import org.scalatest.FunSuite
import org.scalatest.Matchers._
class CustomPartitioner extends FunSuite {
val...