In this section, we'll see how to use GBT to solve both regression and classification problems. In the previous two chapters, Chapter 2, *Scala for Regression Analysis*, and Chapter 3, *Scala for Learning Classification*, we solved the customer churn and insurance severity claim problems, which were classification and regression problem, respectively. In both approaches, we used other classic models. However, we'll see how we can solve them with tree-based and ensemble techniques. We'll use the GBT implementation from the Spark ML package in Scala.

# Gradient boosted trees for supervised learning

# Gradient boosted trees for classification

We know the customer churn prediction problem from Chapter 3, *Scala for Learning Classification*, and we know the data well. We already know the working principles of RF, so let's start using the Spark-based implementation of RF:

- Instantiate a
`GBTClassifier`estimator by invoking the`GBTClassifier()`interface:

val gbt = new GBTClassifier()

.setLabelCol("label")

.setFeaturesCol("features")

.setSeed(1234567L)

- We have three transformers and an estimator ready. Chain in a single pipeline, that is, each of them acts a stage:

// Chain indexers and tree in a Pipeline.

val pipeline = new Pipeline()

.setStages(Array(ScalaClassification.PipelineConstruction.ipindexer,

ScalaClassification.PipelineConstruction.labelindexer,

ScalaClassification.PipelineConstruction.assembler,

gbt))

- Define the
`paramGrid`variable to perform such a grid search over the hyperparameter space:

// Search through decision tree's maxDepth parameter for best model

val paramGrid = new ParamGridBuilder()

.addGrid(gbt.maxDepth, 3 :: 5 :: 10 :: Nil) // :: 15 :: 20 :: 25 :: 30 :: Nil)

.addGrid(gbt.impurity, "gini" :: "entropy" :: Nil)

.addGrid(gbt.maxBins, 5 :: 10 :: 20 :: Nil) //10 :: 15 :: 25 :: 35 :: 45 :: Nil)

.build()

- Define a
`BinaryClassificationEvaluator`evaluator to evaluate the model:

val evaluator = new BinaryClassificationEvaluator()

.setLabelCol("label")

.setRawPredictionCol("prediction")

- We use a
`CrossValidator`for performing 10-fold cross validation for best model selection:

// Set up 10-fold cross validation

val numFolds = 10

val crossval = new CrossValidator()

.setEstimator(pipeline)

.setEvaluator(evaluator)

.setEstimatorParamMaps(paramGrid)

.setNumFolds(numFolds)

- Let's call now the
`fit`method so that the complete predefined pipeline, including all feature preprocessing and DT classifier, is executed multiple times—each time with a different hyperparameter vector:

valcvModel = crossval.fit(Preprocessing.trainDF)

Now it's time to evaluate the predictive power of DT model on the test dataset:

- Transform the test set with the model pipeline, which will update the features as per the same mechanism we described in the preceding feature engineering step:

valpredictions = cvModel.transform(Preprocessing.testSet)

prediction.show(10)

This will lead us to the following DataFrame showing the predicted labels against the actual labels. Additionally, it shows the raw probabilities:

However, after seeing the preceding prediction DataFrame, it is really difficult to guess the classification accuracy.

- But in the second step, in the evaluation is done using
`BinaryClassificationEvaluator`as follows:

valaccuracy = evaluator.evaluate(predictions)

println("Classification accuracy: " + accuracy)

This will give us the classification accuracy:

Accuracy: 0.869460802355539

So we get about 87% classification accuracy from our binary classification model. Just like with SVM and LR, we will observe the area under the precision-recall curve and the area under the ROC curve based on the following RDD, which contains the raw scores on the test set:

valpredictionAndLabels = predictions

.select("prediction", "label")

.rdd.map(x => (x(0).asInstanceOf[Double], x(1)

.asInstanceOf[Double]))

The preceding RDD can be used for computing the previously mentioned performance metrics:

valmetrics =newBinaryClassificationMetrics(predictionAndLabels)

println("Area under the precision-recall curve: " + metrics.areaUnderPR)

println("Area under the receiver operating characteristic (ROC) curve: " + metrics.areaUnderROC)

This will share the value in terms of accuracy and prediction:

Area under the precision-recall curve: 0.7270259009251356Area under the receiver operating characteristic (ROC) curve: 0.869460802355539

In this case, the evaluation returns 87% accuracy but only 73% precision, which is much better than SVM and LR. Then we calculate some more false and true metrics. Positive and negative predictions can also be useful to evaluate the model's performance:

valTC = predDF.count() //Total countvaltp = tVSpDF.filter($"prediction" === 0.0).filter($"label" === $"prediction")

.count() / TC.toDouble // True positive ratevaltn = tVSpDF.filter($"prediction" === 1.0).filter($"label" === $"prediction")

.count() / TC.toDouble // True negative ratevalfp = tVSpDF.filter($"prediction" === 1.0).filter(not($"label" === $"prediction"))

.count() / TC.toDouble // False positive ratevalfn = tVSpDF.filter($"prediction" === 0.0).filter(not($"label" === $"prediction"))

.count() / TC.toDouble // False negative rate

Additionally, we compute the Matthews correlation coefficient:

valMCC = (tp * tn - fp * fn) / math.sqrt((tp + fp) * (tp + fn) * (fp + tn) * (tn + fn))

Let's observe how high the model confidence is:

println("True positive rate: " + tp *100 + "%")

println("False positive rate: " + fp * 100 + "%")

println("True negative rate: " + tn * 100 + "%")

println("False negative rate: " + fn * 100 + "%")

println("Matthews correlation coefficient: " + MCC)

Now let's take a look at the true positive, false positive, true negative, and false negative rates. Additionally, we see the MCC:

True positive rate: 0.7781109445277361False positive rate: 0.07946026986506746True negative rate: 0.1184407796101949False negative rate: 0.0239880059970015Matthews correlation coefficient: 0.6481780577821629

These rates looks promising as we experienced positive MCC that shows mostly positive correlation indicating a robust classifier. Now, similar to DTs, RFs can be debugged during the classification. For the tree to be printed and to select the most important features, run the last few lines of code in the DT. Note that we still confine the hyperparameter space with `numTrees`, `maxBins`, and `maxDepth` by limiting them to `7`. Remember that bigger trees will most likely perform better. Therefore, feel free to play around with this code and add features, and also use a bigger hyperparameter space, for instance, bigger trees.

# GBTs for regression

To reduce the size of a loss function, GBTs will train many DTs. For each instance, the algorithm will use the ensemble that is currently available to predict the label of each training.

Similar to decision trees, GBTs can do the following:

- Handle both categorical and numerical features
- Be used for both binary classification and regression (multiclass classification is not yet supported)
- Do not require feature scaling
- Capture non-linearity and feature interactions from very high-dimensional datasets

Suppose we have *N* data instances (being *x _{i}* = features of instance

*i*) and

*y*is the label (being

*y*= label of instance

_{i}*i*), then

*f(x*is GBT model's predicted label for instance

_{i})*i*, which tries to minimize any of the following losses:

The first equation is called the *log* loss, which is twice the binomial negative *log* likelihood. The second one called squared error is commonly referred to as *L2* loss and the default loss for GBT-based regression task. Finally, the third, called absolute error, is commonly referred to as *L1* loss and is recommended if the data points have many outliers and robust than squared error.

Now that we know the minimum working principle of the GBT regression algorithm, we can get started. Let's instantiate a `GBTRegressor` estimator by invoking the `GBTRegressor()` interface:

valgbtModel = new GBTRegressor().setFeaturesCol("features").setLabelCol("label")

We can set the max bins, number of trees, max depth, and impurity when instantiating the preceding estimator. However, since we'll perform k-fold cross-validation, we can set those parameters while creating the `paramGrid` variable too:

// Search through GBT's parameter for the best modelvarparamGrid = new ParamGridBuilder()

.addGrid(gbtModel.impurity, "variance" :: Nil)// variance for regression

.addGrid(gbtModel.maxBins, 25 :: 30 :: 35 :: Nil)

.addGrid(gbtModel.maxDepth, 5 :: 10 :: 15 :: Nil)

.addGrid(gbtModel.numTrees, 3 :: 5 :: 10 :: 15 :: Nil)

.build()

**Validation while training**: Gradient boosting can overfit, especially when you train your model with more trees. In order to prevent this issue, it is useful to validate (for example, using cross-validation) while carrying out the training.

For a better and more stable performance, let's prepare the k-fold cross-validation and grid search as part of the model tuning. As you can guess, I am going to perform 10-fold cross-validation. Feel free to adjust the number of folds based on your settings and dataset:

println("Preparing K-fold Cross Validation and Grid Search: Model tuning")valnumFolds = 10 // 10-fold cross-validationvalcv = new CrossValidator()

.setEstimator(gbtModel)

.setEvaluator(new RegressionEvaluator)

.setEstimatorParamMaps(paramGrid)

.setNumFolds(numFolds)

Fantastic! We have created the cross-validation estimator. Now it's time to train the `RandomForestRegression` model with cross-validation:

println("Training model with RandomForestRegressor algorithm")valcvModel = cv.fit(trainingData)

Now that we have the fitted model, we can make predictions. Let's start evaluating the model on the train and validation sets and calculate RMSE, MSE, MAE, and R squared error:

println("Evaluating the model on the test set and calculating the regression metrics")valtrainPredictionsAndLabels = cvModel.transform(testData).select("label", "prediction")

.map { case Row(label: Double, prediction: Double)

=> (label, prediction) }.rddvaltestRegressionMetrics = new RegressionMetrics(trainPredictionsAndLabels)

Once we have the best-fitted and cross-validated model, we can expect a high prediction accuracy. Now let's observe the result on the train and the validation sets:

valresults = "\n=====================================================================\n" +

s"TrainingData count: ${trainingData.count}\n" +

s"TestData count: ${testData.count}\n" +

"=====================================================================\n" +

s"TestData MSE = ${testRegressionMetrics.meanSquaredError}\n" +

s"TestData RMSE = ${testRegressionMetrics.rootMeanSquaredError}\n" +

s"TestData R-squared = ${testRegressionMetrics.r2}\n" +

s"TestData MAE = ${testRegressionMetrics.meanAbsoluteError}\n" +

s"TestData explained variance = ${testRegressionMetrics.explainedVariance}\n" +

"=====================================================================\n"

println(results)

The following output shows the MSE, RMSE, R-squared, MAE and explained variance on the test set:

=====================================================================TrainingData count: 80TestData count: 55=====================================================================TestData MSE = 5.99847335425882TestData RMSE = 2.4491780977011084TestData R-squared = 0.4223425609926217TestData MAE = 2.0564380367107646TestData explained variance = 20.340666319995183=====================================================================

Great! We have managed to compute the raw prediction on the train and the test set, and we can see the improvements compared to the LR, DT, and GBT regression models. Let's hunt for the model that helps us to achieve better accuracy:

valbestModel = cvModel.bestModel.asInstanceOf[GBTRegressionModel]

Additionally, we can see how the decisions were made by observing the DTs in the forest:

println("Decision tree from best cross-validated model: " + bestModel.toDebugString)

In the following output, the toDebugString() method prints the tree's decision nodes and final prediction outcomes at the end leaves:

Decision tree from best cross-validated model with 10 treesTree 0 (weight 1.0):If (feature 0 <= 16.0)If (feature 2 <= 1.0)If (feature 15 <= 0.0)If (feature 13 <= 0.0)If (feature 16 <= 0.0)If (feature 0 <= 3.0)If (feature 3 <= 0.0)Predict: 6.128571428571427Else (feature 3 > 0.0)Predict: 3.3999999999999986....Tree 9 (weight 1.0):If (feature 0 <= 22.0)If (feature 2 <= 1.0)If (feature 1 <= 1.0)If (feature 0 <= 1.0)Predict: 3.4...

With random forest, it is possible to measure the feature importance so that in a later stage, we can decide which features to use and which ones to drop from the DataFrame. Let's find the feature importance out of the best model we just created before for all the features that are arranged in an ascending order as follows:

valfeatureImportances = bestModel.featureImportances.toArrayvalFI_to_List_sorted = featureImportances.toList.sorted.toArray

println("Feature importance generated by the best model: ")

for(x <- FI_to_List_sorted) println(x)

Following is the feature importance generated by the model:

Feature importance generated by the best model:0.00.05.767724652714395E-40.0016168728511218740.0063812095260626370.0088678100699503950.0094206687631216530.018020977423614890.0267557383387774070.027615314419024820.0312085341724077820.0336202240270910.038017218348207780.052634750661234120.055625652668413110.132212090769996350.5574261654957049

The last result is important to understand the feature importance. As you can see, the RF has ranked some features that looks to be more important. For example, the last two features are the most important and the first two are less important. We can drop some unimportant columns and train the RF model to observe whether there is any reduction in the R-squared and MAE values on the test set.