Spark does the data processing using the RDDs. From the relevant data source such as text files and NoSQL data stores, data is read to form the RDDs. On such an RDD, various data transformations are performed and finally, the result is collected. To be precise, Spark comes with Spark transformations and Spark actions that act upon RDDs. Let us take the following RDD capturing a list of retail banking transactions, which is of the type RDD[(string, string, double)]:
AccountNo |
TranNo |
TranAmount |
---|---|---|
SB001 |
TR001 |
250.00 |
SB002 |
TR004 |
450.00 |
SB003 |
TR010 |
120.00 |
SB001 |
TR012 |
-120.00 |
SB001 |
TR015 |
-10.00 |
SB003 |
TR020 |
100.00 |
To calculate the account level summary of the transactions from the RDD of the form (AccountNo,TranNo,TranAmount)
:
First it has to be transformed to the form of key-value pairs
(AccountNo,TranAmount)
, whereAccountNo
is the key but there will be multiple elements with the same key.On this key, do a summation operation on...