Fraud detection is not a supervised learning problem. We did not use the random forests algorithm, decision trees, or logistic regression (LR). Instead, we leveraged what is known as a Gaussian Distribution equation to build an algorithm that performed classification, which is really an anomaly detection or identification task. The importance of picking an appropriate Epsilon (error term) to enable the algorithm to find the anomalous samples cannot be overestimated. Otherwise, the algorithm could go off the mark and label non-fraudulent examples as anomalies or outliers that indicate a fraudulent transaction. The point is, tweaking the Epsilon parameter does help with a better fraud detection process.
A good part of the computational power required was devoted to finding the so-called best Epsilon. Computing the best Epsilon was one key part. The other part, of course, was the algorithm itself. This is where Spark helped out a lot. The Spark ecosystem provided us with a powerful environment...