Book Image

Machine Learning with Scala Quick Start Guide

By : Md. Rezaul Karim, Ajay Kumar N
Book Image

Machine Learning with Scala Quick Start Guide

By: Md. Rezaul Karim, Ajay Kumar N

Overview of this book

Scala is a highly scalable integration of object-oriented nature and functional programming concepts that make it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to develop and train effective machine learning models in Scala. The book starts with an introduction to machine learning, while covering deep learning and machine learning basics. It then explains how to use Scala-based ML libraries to solve classification and regression problems using linear regression, generalized linear regression, logistic regression, support vector machine, and Naïve Bayes algorithms. It also covers tree-based ensemble techniques for solving both classification and regression problems. Moving ahead, it covers unsupervised learning techniques, such as dimensionality reduction, clustering, and recommender systems. Finally, it provides a brief overview of deep learning using a real-life example in Scala.
Table of Contents (9 chapters)

Gradient boosted trees for supervised learning

In this section, we'll see how to use GBT to solve both regression and classification problems. In the previous two chapters, Chapter 2, Scala for Regression Analysis, and Chapter 3, Scala for Learning Classification, we solved the customer churn and insurance severity claim problems, which were classification and regression problem, respectively. In both approaches, we used other classic models. However, we'll see how we can solve them with tree-based and ensemble techniques. We'll use the GBT implementation from the Spark ML package in Scala.

Gradient boosted trees for classification

We know the customer churn prediction problem from Chapter 3, Scala for Learning Classification, and we know the data well. We already know the working principles of RF, so let's start using the Spark-based implementation of RF:

  1. Instantiate a GBTClassifier estimator by invoking the GBTClassifier() interface:
val gbt = new GBTClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setSeed(1234567L)
  1. We have three transformers and an estimator ready. Chain in a single pipeline, that is, each of them acts a stage:
// Chain indexers and tree in a Pipeline.
val pipeline = new Pipeline()
.setStages(Array(ScalaClassification.PipelineConstruction.ipindexer,
ScalaClassification.PipelineConstruction.labelindexer,
ScalaClassification.PipelineConstruction.assembler,
gbt))
  1. Define the paramGrid variable to perform such a grid search over the hyperparameter space:
// Search through decision tree's maxDepth parameter for best model
val paramGrid = new ParamGridBuilder()
.addGrid(gbt.maxDepth, 3 :: 5 :: 10 :: Nil) // :: 15 :: 20 :: 25 :: 30 :: Nil)
.addGrid(gbt.impurity, "gini" :: "entropy" :: Nil)
.addGrid(gbt.maxBins, 5 :: 10 :: 20 :: Nil) //10 :: 15 :: 25 :: 35 :: 45 :: Nil)
.build()
  1. Define a BinaryClassificationEvaluator evaluator to evaluate the model:
val evaluator = new BinaryClassificationEvaluator()
.setLabelCol("label")
.setRawPredictionCol("prediction")

  1. We use a CrossValidator for performing 10-fold cross validation for best model selection:
// Set up 10-fold cross validation
val numFolds = 10
val crossval = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds)
  1. Let's call now the fit method so that the complete predefined pipeline, including all feature preprocessing and DT classifier, is executed multiple times—each time with a different hyperparameter vector:
val cvModel = crossval.fit(Preprocessing.trainDF)

Now it's time to evaluate the predictive power of DT model on the test dataset:

  1. Transform the test set with the model pipeline, which will update the features as per the same mechanism we described in the preceding feature engineering step:
val predictions = cvModel.transform(Preprocessing.testSet)
prediction.show(10)

This will lead us to the following DataFrame showing the predicted labels against the actual labels. Additionally, it shows the raw probabilities:

However, after seeing the preceding prediction DataFrame, it is really difficult to guess the classification accuracy.

  1. But in the second step, in the evaluation is done using BinaryClassificationEvaluator as follows:
val accuracy = evaluator.evaluate(predictions)
println("Classification accuracy: " + accuracy)

This will give us the classification accuracy:

Accuracy: 0.869460802355539

So we get about 87% classification accuracy from our binary classification model. Just like with SVM and LR, we will observe the area under the precision-recall curve and the area under the ROC curve based on the following RDD, which contains the raw scores on the test set:

val predictionAndLabels = predictions
.select("prediction", "label")
.rdd.map(x => (x(0).asInstanceOf[Double], x(1)
.asInstanceOf[Double]))

The preceding RDD can be used for computing the previously mentioned performance metrics:

val metrics = new BinaryClassificationMetrics(predictionAndLabels)
println("Area under the precision-recall curve: " + metrics.areaUnderPR)
println("Area under the receiver operating characteristic (ROC) curve: " + metrics.areaUnderROC)

This will share the value in terms of accuracy and prediction:

Area under the precision-recall curve: 0.7270259009251356
Area under the receiver operating characteristic (ROC) curve: 0.869460802355539

In this case, the evaluation returns 87% accuracy but only 73% precision, which is much better than SVM and LR. Then we calculate some more false and true metrics. Positive and negative predictions can also be useful to evaluate the model's performance:

val TC = predDF.count() //Total count

val tp = tVSpDF.filter($"prediction" === 0.0).filter($"label" === $"prediction")
.count() / TC.toDouble // True positive rate
val tn = tVSpDF.filter($"prediction" === 1.0).filter($"label" === $"prediction")
.count() / TC.toDouble // True negative rate
val fp = tVSpDF.filter($"prediction" === 1.0).filter(not($"label" === $"prediction"))
.count() / TC.toDouble // False positive rate
val fn = tVSpDF.filter($"prediction" === 0.0).filter(not($"label" === $"prediction"))
.count() / TC.toDouble // False negative rate

Additionally, we compute the Matthews correlation coefficient:

val MCC = (tp * tn - fp * fn) / math.sqrt((tp + fp) * (tp + fn) * (fp + tn) * (tn + fn)) 

Let's observe how high the model confidence is:

println("True positive rate: " + tp *100 + "%")
println("False positive rate: " + fp * 100 + "%")
println("True negative rate: " + tn * 100 + "%")
println("False negative rate: " + fn * 100 + "%")
println("Matthews correlation coefficient: " + MCC)

Now let's take a look at the true positive, false positive, true negative, and false negative rates. Additionally, we see the MCC:

True positive rate: 0.7781109445277361
False positive rate: 0.07946026986506746
True negative rate: 0.1184407796101949
False negative rate: 0.0239880059970015
Matthews correlation coefficient: 0.6481780577821629

These rates looks promising as we experienced positive MCC that shows mostly positive correlation indicating a robust classifier. Now, similar to DTs, RFs can be debugged during the classification. For the tree to be printed and to select the most important features, run the last few lines of code in the DT. Note that we still confine the hyperparameter space with numTrees, maxBins, and maxDepth by limiting them to 7. Remember that bigger trees will most likely perform better. Therefore, feel free to play around with this code and add features, and also use a bigger hyperparameter space, for instance, bigger trees.

GBTs for regression

To reduce the size of a loss function, GBTs will train many DTs. For each instance, the algorithm will use the ensemble that is currently available to predict the label of each training.

Similar to decision trees, GBTs can do the following:

  • Handle both categorical and numerical features
  • Be used for both binary classification and regression (multiclass classification is not yet supported)
  • Do not require feature scaling
  • Capture non-linearity and feature interactions from very high-dimensional datasets

Suppose we have N data instances (being xi = features of instance i) and y is the label (being yi = label of instance i), then f(xi) is GBT model's predicted label for instance i, which tries to minimize any of the following losses:

The first equation is called the log loss, which is twice the binomial negative log likelihood. The second one called squared error is commonly referred to as L2 loss and the default loss for GBT-based regression task. Finally, the third, called absolute error, is commonly referred to as L1 loss and is recommended if the data points have many outliers and robust than squared error.

Now that we know the minimum working principle of the GBT regression algorithm, we can get started. Let's instantiate a GBTRegressor estimator by invoking the GBTRegressor() interface:

val gbtModel = new GBTRegressor().setFeaturesCol("features").setLabelCol("label")

We can set the max bins, number of trees, max depth, and impurity when instantiating the preceding estimator. However, since we'll perform k-fold cross-validation, we can set those parameters while creating the paramGrid variable too:

// Search through GBT's parameter for the best model
var paramGrid = new ParamGridBuilder()
.addGrid(gbtModel.impurity, "variance" :: Nil)// variance for regression
.addGrid(gbtModel.maxBins, 25 :: 30 :: 35 :: Nil)
.addGrid(gbtModel.maxDepth, 5 :: 10 :: 15 :: Nil)
.addGrid(gbtModel.numTrees, 3 :: 5 :: 10 :: 15 :: Nil)
.build()
Validation while training: Gradient boosting can overfit, especially when you train your model with more trees. In order to prevent this issue, it is useful to validate (for example, using cross-validation) while carrying out the training.

For a better and more stable performance, let's prepare the k-fold cross-validation and grid search as part of the model tuning. As you can guess, I am going to perform 10-fold cross-validation. Feel free to adjust the number of folds based on your settings and dataset:

println("Preparing K-fold Cross Validation and Grid Search: Model tuning")
val numFolds = 10 // 10-fold cross-validation
val cv = new CrossValidator()
.setEstimator(gbtModel)
.setEvaluator(new RegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds)

Fantastic! We have created the cross-validation estimator. Now it's time to train the RandomForestRegression model with cross-validation:

println("Training model with RandomForestRegressor algorithm")
val cvModel = cv.fit(trainingData)

Now that we have the fitted model, we can make predictions. Let's start evaluating the model on the train and validation sets and calculate RMSE, MSE, MAE, and R squared error:

println("Evaluating the model on the test set and calculating the regression metrics")
val trainPredictionsAndLabels = cvModel.transform(testData).select("label", "prediction")
.map { case Row(label: Double, prediction: Double)
=> (label, prediction) }.rdd

val testRegressionMetrics = new RegressionMetrics(trainPredictionsAndLabels)

Once we have the best-fitted and cross-validated model, we can expect a high prediction accuracy. Now let's observe the result on the train and the validation sets:

val results = "\n=====================================================================\n" +
s"TrainingData count: ${trainingData.count}\n" +
s"TestData count: ${testData.count}\n" +
"=====================================================================\n" +
s"TestData MSE = ${testRegressionMetrics.meanSquaredError}\n" +
s"TestData RMSE = ${testRegressionMetrics.rootMeanSquaredError}\n" +
s"TestData R-squared = ${testRegressionMetrics.r2}\n" +
s"TestData MAE = ${testRegressionMetrics.meanAbsoluteError}\n" +
s"TestData explained variance = ${testRegressionMetrics.explainedVariance}\n" +
"=====================================================================\n"
println(results)

The following output shows the MSE, RMSE, R-squared, MAE and explained variance on the test set:

=====================================================================
TrainingData count: 80
TestData count: 55
=====================================================================
TestData MSE = 5.99847335425882
TestData RMSE = 2.4491780977011084
TestData R-squared = 0.4223425609926217
TestData MAE = 2.0564380367107646
TestData explained variance = 20.340666319995183
=====================================================================

Great! We have managed to compute the raw prediction on the train and the test set, and we can see the improvements compared to the LR, DT, and GBT regression models. Let's hunt for the model that helps us to achieve better accuracy:

val bestModel = cvModel.bestModel.asInstanceOf[GBTRegressionModel]

Additionally, we can see how the decisions were made by observing the DTs in the forest:

println("Decision tree from best cross-validated model: " + bestModel.toDebugString)

In the following output, the toDebugString() method prints the tree's decision nodes and final prediction outcomes at the end leaves:

Decision tree from best cross-validated model with 10 trees
Tree 0 (weight 1.0):
If (feature 0 <= 16.0)
If (feature 2 <= 1.0)
If (feature 15 <= 0.0)
If (feature 13 <= 0.0)
If (feature 16 <= 0.0)
If (feature 0 <= 3.0)
If (feature 3 <= 0.0)
Predict: 6.128571428571427
Else (feature 3 > 0.0)
Predict: 3.3999999999999986
....
Tree 9 (weight 1.0):
If (feature 0 <= 22.0)
If (feature 2 <= 1.0)
If (feature 1 <= 1.0)
If (feature 0 <= 1.0)
Predict: 3.4
...

With random forest, it is possible to measure the feature importance so that in a later stage, we can decide which features to use and which ones to drop from the DataFrame. Let's find the feature importance out of the best model we just created before for all the features that are arranged in an ascending order as follows:

val featureImportances = bestModel.featureImportances.toArray

val FI_to_List_sorted = featureImportances.toList.sorted.toArray
println("Feature importance generated by the best model: ")
for(x <- FI_to_List_sorted) println(x)

Following is the feature importance generated by the model:

Feature importance generated by the best model:
0.0
0.0
5.767724652714395E-4
0.001616872851121874
0.006381209526062637
0.008867810069950395
0.009420668763121653
0.01802097742361489
0.026755738338777407
0.02761531441902482
0.031208534172407782
0.033620224027091
0.03801721834820778
0.05263475066123412
0.05562565266841311
0.13221209076999635
0.5574261654957049

The last result is important to understand the feature importance. As you can see, the RF has ranked some features that looks to be more important. For example, the last two features are the most important and the first two are less important. We can drop some unimportant columns and train the RF model to observe whether there is any reduction in the R-squared and MAE values on the test set.