HDFS is a system designed for storing massive volumes of data. In our case, we start with the 3-year banking transaction history of a fictitious customer of a bank. Our dataset includes 2,191 transactions that have resulted in the transfer of money from the customer's account to other accounts. These transactions happened using a variety of methods, such as payments at a POS terminal, direct debits, transfers from internet banking, and so on. The result of these transactions is that money leaves the account of the customer and gets credited to another account. All the times, the customer's bank wants to ensure that the money only leaves the account of the customer when the customer has authorized it. Otherwise, a transaction is a fraudulent transaction and it must be stopped.
Storing and processing 2,191 records might seem a trivial task from the point of view of HDFS and Spark. However, if a bank has 10 million customers, then to build a fraud model for...