In order to minimize a `loss`

function, **Gradient Boosting Trees** (**GBTs**) iteratively train many decision trees. On each iteration, the algorithm uses the current ensemble to predict the label of each training instance.

Then the raw predictions are compared with the true labels. Thus, in the next iteration, the decision tree will help correct previous mistakes if the dataset is re-labeled to put more emphasis on training instances with poor predictions.

Since we are talking about regression, it would be more meaningful to discuss the regression strength of GBTs and its losses computation. Suppose we have the following settings:

*N*data instances*y*= label of instance_{i}*i**x*= features of instance_{i}*i*

Then the *F(x _{i})* function is the model's predicted label; for instance, it tries to minimize the error, that is, loss:

Now, similar to decision trees, GBTs also:

- Handle categorical features (and of course numerical features too)
- Extend to the multiclass classification setting
- Perform both the binary classification and regression (multiclass classification is not yet supported)
- Do not require feature scaling
- Capture non-linearity and feature interactions, which are greatly missing in LR, such as linear models

### Note

**Validation while training**: Gradient boosting can overfit, especially when you have trained your model with more trees. In order to prevent this issue, it is useful to validate while carrying out the training.

Since we have already prepared our dataset, we can directly jump into implementing a GBT-based predictive model for predicting insurance severity claims. Let's start with importing the necessary packages and libraries:

importorg.apache.spark.ml.regression.{GBTRegressor, GBTRegressionModel}importorg.apache.spark.ml.{Pipeline, PipelineModel}importorg.apache.spark.ml.evaluation.RegressionEvaluatorimportorg.apache.spark.ml.tuning.ParamGridBuilderimportorg.apache.spark.ml.tuning.CrossValidatorimportorg.apache.spark.sql._importorg.apache.spark.sql.functions._importorg.apache.spark.mllib.evaluation.RegressionMetrics

Now let's define and initialize the hyperparameters needed to train the GBTs, such as the number of trees, number of max bins, number of folds to be used during cross-validation, number of maximum iterations to iterate the training, and finally max tree depth:

valNumTrees = Seq(5, 10, 15)valMaxBins = Seq(5, 7, 9)valnumFolds = 10valMaxIter: Seq[Int] = Seq(10)valMaxDepth: Seq[Int] = Seq(10)

Then, again we instantiate a Spark session and implicits as follows:

valspark = SparkSessionCreate.createSession()importspark.implicits._

Now that we care an estimator algorithm, that is, GBT:

**val** model = new GBTRegressor()
.setFeaturesCol("features")
.setLabelCol("label")

Now, we build the pipeline by chaining the transformations and predictor together as follows:

**val** pipeline = new Pipeline().setStages((Preproessing.stringIndexerStages :+ Preproessing.assembler) :+ model)

Before we start performing the cross-validation, we need to have a paramgrid. So let's start creating the paramgrid by specifying the number of maximum iteration, max tree depth, and max bins as follows:

**val** paramGrid = new ParamGridBuilder()
.addGrid(model.maxIter, MaxIter)
.addGrid(model.maxDepth, MaxDepth)
.addGrid(model.maxBins, MaxBins)
.build()

Now, for a better and stable performance, let's prepare the K-fold cross-validation and grid search as a part of model tuning. As you can guess, I am going to perform 10-fold cross-validation. Feel free to adjust the number of folds based on you settings and dataset:

```
println("Preparing K-fold Cross Validation and Grid Search")
```**val** cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new RegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds)

Fantastic, we have created the cross-validation estimator. Now it's time to train the GBT model:

```
println("Training model with GradientBoostedTrees algorithm ")
```**val** cvModel = cv.fit(Preproessing.trainingData)

Now that we have the fitted model, that means it is now capable of making predictions. So let's start evaluating the model on the train and validation set, and calculating RMSE, MSE, MAE, R-squared, and so on:

println("Evaluating model on train and test data and calculating RMSE")valtrainPredictionsAndLabels = cvModel.transform(Preproessing.trainingData).select("label", "prediction").map { case Row(label: Double, prediction: Double) => (label, prediction) }.rddvalvalidPredictionsAndLabels = cvModel.transform(Preproessing.validationData).select("label", "prediction").map { case Row(label: Double, prediction: Double) => (label, prediction) }.rddvaltrainRegressionMetrics = new RegressionMetrics(trainPredictionsAndLabels)valvalidRegressionMetrics = new RegressionMetrics(validPredictionsAndLabels)

Great! We have managed to compute the raw prediction on the train and the test set. Let's hunt for the best model:

**val** bestModel = cvModel.bestModel.asInstanceOf[PipelineModel]

As already stated, by using GBT it is possible to measure feature importance so that at a later stage we can decide which features are to be used and which ones are to be dropped from the DataFrame. Let's find the feature importance of the best model we just created previously, for all features in ascending order as follows:

valfeatureImportances = bestModel.stages.last.asInstanceOf[GBTRegressionModel].featureImportances.toArrayvalFI_to_List_sorted = featureImportances.toList.sorted.toArray

Once we have the best fitted and cross-validated model, we can expect good prediction accuracy. Now let's observe the results on the train and the validation set:

**val** output = "n=====================================================================n" + s"Param trainSample: ${Preproessing.trainSample}n" +
s"Param testSample: ${Preproessing.testSample}n" +
s"TrainingData count: ${Preproessing.trainingData.count}n" +
s"ValidationData count: ${Preproessing.validationData.count}n" +
s"TestData count: ${Preproessing.testData.count}n" + "=====================================================================n" + s"Param maxIter = ${MaxIter.mkString(",")}n" +
s"Param maxDepth = ${MaxDepth.mkString(",")}n" +
s"Param numFolds = ${numFolds}n" + "=====================================================================n" + s"Training data MSE = ${trainRegressionMetrics.meanSquaredError}n" +
s"Training data RMSE = ${trainRegressionMetrics.rootMeanSquaredError}n" +
s"Training data R-squared = ${trainRegressionMetrics.r2}n" +
s"Training data MAE = ${trainRegressionMetrics.meanAbsoluteError}n" +
s"Training data Explained variance = ${trainRegressionMetrics.explainedVariance}n" + "=====================================================================n" + s"Validation data MSE = ${validRegressionMetrics.meanSquaredError}n" +
s"Validation data RMSE = ${validRegressionMetrics.rootMeanSquaredError}n" +
s"Validation data R-squared = ${validRegressionMetrics.r2}n" +
s"Validation data MAE = ${validRegressionMetrics.meanAbsoluteError}n" +
s"Validation data Explained variance = ${validRegressionMetrics.explainedVariance}n" + "=====================================================================n" + s"CV params explained: ${cvModel.explainParams}n" +
s"GBT params explained: ${bestModel.stages.last.asInstanceOf[GBTRegressionModel].explainParams}n" + s"GBT features importances:n ${Preproessing.featureCols.zip(FI_to_List_sorted).map(t => s"t${t._1} = ${t._2}").mkString("n")}n" + "=====================================================================n"

Now, we print the preceding results as follows:

println(results) >>>

===================================================================== Param trainSample: 1.0 Param testSample: 1.0 TrainingData count: 141194 ValidationData count: 47124 TestData count: 125546 ===================================================================== Param maxIter = 10 Param maxDepth = 10 Param numFolds = 10 ===================================================================== Training data MSE = 2711134.460296872 Training data RMSE = 1646.5522950385973 Training data R-squared = 0.4979619968485668 Training data MAE = 1126.582534126603 Training data Explained variance = 8336528.638733303 ===================================================================== Validation data MSE = 4796065.983773314 Validation data RMSE = 2189.9922337244293 Validation data R-squared = 0.13708582379658474 Validation data MAE = 1289.9808960385383 Validation data Explained variance = 8724866.468978886 ===================================================================== CV params explained: estimator: estimator for selection (current: pipeline_9889176c6eda) estimatorParamMaps: param maps for the estimator (current: [Lorg.apache.spark.ml.param.ParamMap;@87dc030) evaluator: evaluator used to select hyper-parameters that maximize the validated metric (current: regEval_ceb3437b3ac7) numFolds: number of folds for cross validation (>= 2) (default: 3, current: 10) seed: random seed (default: -1191137437) GBT params explained: cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. (default: false) checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations (default: 10) featuresCol: features column name (default: features, current: features) impurity: Criterion used for information gain calculation (case-insensitive). Supported options: variance (default: variance) labelCol: label column name (default: label, current: label) lossType: Loss function which GBT tries to minimize (case-insensitive). Supported options: squared, absolute (default: squared) maxBins: Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature. (default: 32) maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 5, current: 10) maxIter: maximum number of iterations (>= 0) (default: 20, current: 10) maxMemoryInMB: Maximum memory in MB allocated to histogram aggregation. (default: 256) minInfoGain: Minimum information gain for a split to be considered at a tree node. (default: 0.0) minInstancesPerNode: Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1. (default: 1) predictionCol: prediction column name (default: prediction) seed: random seed (default: -131597770) stepSize: Step size (a.k.a. learning rate) in interval (0, 1] for shrinking the contribution of each estimator. (default: 0.1) subsamplingRate: Fraction of the training data used for learning each decision tree, in range (0, 1]. (default: 1.0) GBT features importance: idx_cat1 = 0.0 idx_cat2 = 0.0 idx_cat3 = 0.0 idx_cat4 = 3.167169394850417E-5 idx_cat5 = 4.745749854188828E-5 ... idx_cat111 = 0.018960701085054904 idx_cat114 = 0.020609596772820878 idx_cat115 = 0.02281267960792931 cont1 = 0.023943087007850663 cont2 = 0.028078353534251005 ... cont13 = 0.06921704925937068 cont14 = 0.07609111789104464 =====================================================================

So our predictive model shows an MAE of about `1126.582534126603`

and `1289.9808960385383`

for the training and test sets respectively. The last result is important for understanding the feature importance (the preceding list is abridged to save space but you should receive the full list). Especially, we can see that the first three features are not important at all so we can safely drop them from the DataFrame. We will provide more insight in the next section.

Now finally, let us run the prediction over the test set and generate the predicted loss for each claim from the clients:

println("Run prediction over test dataset") cvModel.transform(Preproessing.testData) .select("id", "prediction") .withColumnRenamed("prediction", "loss") .coalesce(1) .write.format("com.databricks.spark.csv") .option("header", "true") .save("output/result_GBT.csv")

The preceding code should generate a CSV file named `result_GBT.csv`

. If we open the file, we should observe the loss against each ID, that is, claim. We will see the contents for both LR, RF, and GBT at the end of this chapter. Nevertheless, it is always a good idea to stop the Spark session by invoking the `spark.stop()`

method.