In this section, we will explore the reason why we use aggregateByKey instead of groupBy.
We will cover the following topics:
- Why we should avoid the use of groupByKey
- What aggregateByKey gives us
- Implementing logic using aggregateByKey
First, we will create our array of user transactions, as shown in the following example:
val keysWithValuesList =
Array(
UserTransaction("A", 100),
UserTransaction("B", 4),
UserTransaction("A", 100001),
UserTransaction("B", 10),
UserTransaction("C", 10)
)
We will then use parallelize to create an RDD, as we want our data to be key-wise. This is shown in the following example:
val data = spark.parallelize(keysWithValuesList)
val keyed = data.keyBy(_.userId)
In the preceding code, we invoked keyBy for userId to have the data of payers, key, and user...