Scala Machine Learning Projects

Scala Machine Learning Projects

Overview of this book

Machine learning has had a huge impact on academia and industry by turning data into actionable information. Scala has seen a steady rise in adoption over the past few years, especially in the fields of data science and analytics. This book is for data scientists, data engineers, and deep learning enthusiasts who have a background in complex numerical computing and want to know more hands-on machine learning application development. If you're well versed in machine learning concepts and want to expand your knowledge by delving into the practical implementation of these concepts using the power of Scala, then this book is what you need! Through 11 end-to-end projects, you will be acquainted with popular machine learning libraries such as Spark ML, H2O, DeepLearning4j, and MXNet. At the end, you will be able to use numerical computing and functional programming to carry out complex numerical tasks to develop, build, and deploy research or commercial projects in a production-ready environment.

Title Page

Packt Upsell

Contributors

Preface

Free Chapter

Analyzing Insurance Severity Claims

Machine learning and learning workflow

Hyperparameter tuning and cross-validation

Analyzing and predicting insurance severity claims

LR for predicting insurance severity claims

GBT regressor for predicting insurance severity claims

Boosting the performance using random forest regressor

Comparative analysis and model deployment

Summary

Analyzing and Predicting Telecommunication Churn

Why do we perform churn analysis, and how do we do it?

Developing a churn analytics pipeline

LR for churn prediction

SVM for churn prediction

DTs for churn prediction

Random Forest for churn prediction

Selecting the best model for deployment

Summary

High Frequency Bitcoin Price Prediction from Historical and Live Data

Bitcoin, cryptocurrency, and online trading

High-level data pipeline of the prototype

Historical and live-price data collection

Model training for prediction

Scala Play web service

Predicting prices and evaluating the model

Demo prediction using Scala Play framework

Summary

Population-Scale Clustering and Ethnicity Prediction

Population scale clustering and geographic ethnicity

1000 Genomes Projects dataset description

Algorithms, tools, and techniques

Configuring programming environment

Data pre-processing and feature engineering

Summary

Topic Modeling - A Better Insight into Large-Scale Texts

Topic modeling and text clustering

Topic modeling with Spark MLlib and Stanford NLP

Other topic models versus the scalability of LDA

Deploying the trained LDA model

Summary

Developing Model-based Movie Recommendation Engines

Recommendation system

Spark-based movie recommendation systems

Selecting and deploying the best model

Summary

Options Trading Using Q-learning and Scala Play Framework

Reinforcement versus supervised and unsupervised learning

A simple Q-learning implementation

Developing an options trading web app using Q-learning

Summary

Clients Subscription Assessment for Bank Telemarketing using Deep Neural Networks

Client subscription assessment through telemarketing

Summary

Fraud Analytics Using Autoencoders and Anomaly Detection

Outlier and anomaly detection

Autoencoders and unsupervised learning

Developing a fraud analytics model

Hyperparameter tuning and feature selection

Summary

Human Activity Recognition using Recurrent Neural Networks

Working with RNNs

Human activity recognition using the LSTM model

Implementing an LSTM model for HAR

Tuning LSTM hyperparameters and GRU

Summary

Image Classification using Convolutional Neural Networks

Image classification and drawbacks of DNNs

CNN architecture

Large-scale image classification using CNN

Tuning and optimizing CNN hyperparameters

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Boosting the performance using random forest regressor

In the previous sections, we did not experience the expected MAE value although we got predictions of the severity loss in each instance. In this section, we will develop a more robust predictive analytics model for the same purpose but use an random forest regressor. However, before diving into its formal implementation, a short overview of the random forest algorithm is needed.

Random Forest for classification and regression

Random Forest is an ensemble learning technique used for solving supervised learning tasks, such as classification and regression. An advantageous feature of Random Forest is that it can overcome the overfitting problem across its training dataset. A forest in Random Forest usually consists of hundreds of thousands of trees. These trees are actually trained on different parts of the same training set.

More technically, an individual tree that grows very deep tends to learn from highly unpredictable patterns. This creates overfitting problems on the training sets. Moreover, low biases make the classifier a low performer even if your dataset quality is good in terms of the features presented. On the other hand, an Random Forest helps to average multiple decision trees together with the goal of reducing the variance to ensure consistency by computing proximities between pairs of cases.

Note

GBT or Random Forest? Although both GBT and Random Forest are ensembles of trees, the training processes are different. There are several practical trade-offs that exist, which often poses the dilemma of which one to choose. However, Random Forest would be the winner in most cases. Here are some justifications:

GBTs train one tree at a time, but Random Forest can train multiple trees in parallel. So the training time is lower for RF. However, in some special cases, training and using a smaller number of trees with GBTs is easier and quicker.
RFs are less prone to overfitting in most cases, so it reduces the likelihood of overfitting. In other words, Random Forest reduces variance with more trees, but GBTs reduce bias with more trees.
Finally, Random Forest can be easier to tune since performance improves monotonically with the number of trees, but GBT performs badly with an increased number of trees.

However, this slightly increases bias and makes it harder to interpret the results. But eventually, the performance of the final model increases dramatically. While using the Random Forest as a classifier, there are some parameter settings:

If the number of trees is 1, then no bootstrapping is used at all; however, if the number of trees is > 1, then bootstrapping is needed. The supported values are auto, all, sqrt, log2, and onethird.
The supported numerical values are (0.0-1.0) and [1-n]. However, if featureSubsetStrategy is chosen as auto, the algorithm chooses the best feature subset strategy automatically.
If the numTrees == 1, the featureSubsetStrategy is set to be all. However, if the numTrees > 1 (that is, forest), the featureSubsetStrategy is set to be sqrt for classification.
Moreover, if a real value n is set in the range of (0, 1.0), n*number_of_features will be used. However, if an integer value n is in the range (1, the number of features) is set, only n features are used alternatively.
The parameter categoricalFeaturesInfo is a map used for storing arbitrary or of categorical features. An entry (n -> k) indicates that feature n is categorical with I categories indexed from 0: (0, 1,...,k-1).
The impurity criterion is used for information gain calculation. The supported values are gini and variance for classification and regression respectively.
The maxDepth is the maximum depth of the tree (for example, depth 0 means one leaf node, depth 1 means one internal node plus two leaf nodes).
The maxBins signifies the maximum number of bins used for splitting the features, where the suggested value is 100 to get better results.
Finally, the random seed is used for bootstrapping and choosing feature subsets to avoid the random nature of the results.

As already mentioned, since Random Forest is fast and scalable enough for a large-scale dataset, Spark is a suitable technology to implement the RF, and to implement this massive scalability. However, if the proximities are calculated, storage requirements also grow exponentially.

Well, that's enough about RF. Now it's time to get our hands dirty, so let's get started. We begin with importing required libraries:

import org.apache.spark.ml.regression.{RandomForestRegressor, RandomForestRegressionModel} 
import org.apache.spark.ml.{ Pipeline, PipelineModel } 
import org.apache.spark.ml.evaluation.RegressionEvaluator 
import org.apache.spark.ml.tuning.ParamGridBuilder 
import org.apache.spark.ml.tuning.CrossValidator 
import org.apache.spark.sql._ 
import org.apache.spark.sql.functions._ 
import org.apache.spark.mllib.evaluation.RegressionMetrics

Then we create an active Spark session and import implicits:

val spark = SparkSessionCreate.createSession() 
import spark.implicits._

Then we define some hyperparameters, such as the number of folds for cross-validation, number of maximum iterations, the value of regression parameters, value of tolerance, and elastic network parameters, as follows:

val NumTrees = Seq(5,10,15)  
val MaxBins = Seq(23,27,30)  
val numFolds = 10  
val MaxIter: Seq[Int] = Seq(20) 
val MaxDepth: Seq[Int] = Seq(20)

Note that for an Random Forest based on a decision tree, we require maxBins to be at least as large as the number of values in each categorical feature. In our dataset, we have 110 categorical features with 23 distinct values. Considering this, we have to set MaxBins to at least 23. Nevertheless, feel free to play with the previous parameters too. Alright, now it's time to create an LR estimator:

val model = new RandomForestRegressor().setFeaturesCol("features").setLabelCol("label")

Now let's build a pipeline estimator by chaining the transformer and the LR estimator:

println("Building ML pipeline") 
val pipeline = new Pipeline().setStages((Preproessing.stringIndexerStages :+ Preproessing.assembler) :+ model)

Before we start performing the cross-validation, we need to have a paramgrid. So let's start creating the paramgrid by specifying the number of trees, a number for maximum tree depth, and the number of maximum bins parameters, as follows:

val paramGrid = new ParamGridBuilder() 
      .addGrid(model.numTrees, NumTrees) 
      .addGrid(model.maxDepth, MaxDepth) 
      .addGrid(model.maxBins, MaxBins) 
      .build()

Now, for better and stable performance, let's prepare the K-fold cross-validation and grid search as a part of model tuning. As you can probably guess, I am going to perform 10-fold cross-validation. Feel free to adjust the number of folds based on your settings and dataset:

println("Preparing K-fold Cross Validation and Grid Search: Model tuning") 
val cv = new CrossValidator() 
      .setEstimator(pipeline) 
      .setEvaluator(new RegressionEvaluator) 
      .setEstimatorParamMaps(paramGrid) 
      .setNumFolds(numFolds)

Fantastic, we have created the cross-validation estimator. Now it's time to train the LR model:

println("Training model with Random Forest algorithm")  
val cvModel = cv.fit(Preproessing.trainingData)

Now that we have the fitted model, that means it is now capable of making predictions. So let's start evaluating the model on the train and validation set, and calculating RMSE, MSE, MAE, R-squared, and many more:

println("Evaluating model on train and validation set and calculating RMSE") 
val trainPredictionsAndLabels = cvModel.transform(Preproessing.trainingData).select("label", "prediction").map { case Row(label: Double, prediction: Double) => (label, prediction) }.rdd 
 
val validPredictionsAndLabels = cvModel.transform(Preproessing.validationData).select("label", "prediction").map { case Row(label: Double, prediction: Double) => (label, prediction) }.rdd 
 
val trainRegressionMetrics = new RegressionMetrics(trainPredictionsAndLabels) 
val validRegressionMetrics = new RegressionMetrics(validPredictionsAndLabels)

Great! We have managed to compute the raw prediction on the train and the test set. Let's hunt for the best model:

val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]

As already stated, by using RF, it is possible to measure the feature importance so that at a later stage, we can decide which features should be used and which ones are to be dropped from the DataFrame. Let's find the feature importance from the best model we just created for all features in ascending order, as follows:

val featureImportances = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel].featureImportances.toArray 
val FI_to_List_sorted = featureImportances.toList.sorted.toArray

Once we have the best fitted and cross-validated model, we can expect a good prediction accuracy. Now let's observe the results on the train and the validation set:

val output = "n=====================================================================n" + s"Param trainSample: ${Preproessing.trainSample}n" + 
      s"Param testSample: ${Preproessing.testSample}n" + 
      s"TrainingData count: ${Preproessing.trainingData.count}n" + 
      s"ValidationData count: ${Preproessing.validationData.count}n" + 
      s"TestData count: ${Preproessing.testData.count}n" +      "=====================================================================n" +   s"Param maxIter = ${MaxIter.mkString(",")}n" + 
      s"Param maxDepth = ${MaxDepth.mkString(",")}n" + 
      s"Param numFolds = ${numFolds}n" +      "=====================================================================n" +   s"Training data MSE = ${trainRegressionMetrics.meanSquaredError}n" + 
      s"Training data RMSE = ${trainRegressionMetrics.rootMeanSquaredError}n" + 
      s"Training data R-squared = ${trainRegressionMetrics.r2}n" + 
      s"Training data MAE = ${trainRegressionMetrics.meanAbsoluteError}n" + 
      s"Training data Explained variance = ${trainRegressionMetrics.explainedVariance}n" +      "=====================================================================n" +   s"Validation data MSE = ${validRegressionMetrics.meanSquaredError}n" + 
      s"Validation data RMSE = ${validRegressionMetrics.rootMeanSquaredError}n" + 
      s"Validation data R-squared = ${validRegressionMetrics.r2}n" + 
      s"Validation data MAE = ${validRegressionMetrics.meanAbsoluteError}n" + 
      s"Validation data Explained variance =
${validRegressionMetrics.explainedVariance}n" +      "=====================================================================n" +   s"CV params explained: ${cvModel.explainParams}n" + 
      s"RF params explained: ${bestModel.stages.last.asInstanceOf[RandomForestRegressionModel].explainParams}n" + 
      s"RF features importances:n ${Preproessing.featureCols.zip(FI_to_List_sorted).map(t => s"t${t._1} = ${t._2}").mkString("n")}n" +      "=====================================================================n"

Now, we print the preceding results as follows:

println(results)
>>>
Param trainSample: 1.0
 Param testSample: 1.0
 TrainingData count: 141194
 ValidationData count: 47124
 TestData count: 125546
 Param maxIter = 20
 Param maxDepth = 20
 Param numFolds = 10
 Training data MSE = 1340574.3409399686
 Training data RMSE = 1157.8317412042081
 Training data R-squared = 0.7642745310548124
 Training data MAE = 809.5917285994619
 Training data Explained variance = 8337897.224852404
 Validation data MSE = 4312608.024875177
 Validation data RMSE = 2076.6819749001475
 Validation data R-squared = 0.1369507149716651"
 Validation data MAE = 1273.0714382935894
 Validation data Explained variance = 8737233.110450774

So our predictive model shows an MAE of about 809.5917285994619 and 1273.0714382935894 for the training and test set respectively. The last result is important for understanding the feature importance (the preceding list is abridged to save space but you should receive the full list).

I have drawn both the categorical and continuous features, and their respective importance in Python, so I will not show the code here but only the graph. Let's see the categorical features showing feature importance as well as the corresponding feature number:

Figure 11: Random Forest categorical feature importance

From the preceding graph, it is clear that categorical features cat20, cat64, cat47, and cat69 are less important. Therefore, it would make sense to drop these features and retrain the Random Forest model to observe better performance.

Now let's see how the continuous features are correlated and contribute to the loss column. From the following figure, we can see that all continuous features are positively correlated with the loss column. This also signifies that these continuous features are not that important compared to the categorical ones we have seen in the preceding figure:

Figure 12: Correlations between the continuous features and the label

What we can learn from these two analyses is that we can naively drop some unimportant columns and train the Random Forest model to observe if there is any reduction in the MAE value for both the training and validation set. Finally, let's make a prediction on the test set:

println("Run prediction on the test set") 
cvModel.transform(Preproessing.testData) 
      .select("id", "prediction") 
      .withColumnRenamed("prediction", "loss") 
      .coalesce(1) // to get all the predictions in a single csv file                 
      .write.format("com.databricks.spark.csv") 
      .option("header", "true") 
      .save("output/result_RF.csv")

Also, similar to LR, you can stop the Spark session by invoking the stop() method. Now the generated result_RF.csv file should contain the loss against each ID, that is, claim.

Scala Machine Learning Projects

Scala Machine Learning Projects

Overview of this book

Related Content you might be interested in

Current Title:

Scala Machine Learning Projects

Java Deep Learning Projects

Predictive Analytics with TensorFlow

Deep Learning with TensorFlow

Boosting the performance using random forest regressor

Random Forest for classification and regression

Note