Overview of this book

Scala is a highly scalable integration of object-oriented nature and functional programming concepts that make it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to develop and train effective machine learning models in Scala. The book starts with an introduction to machine learning, while covering deep learning and machine learning basics. It then explains how to use Scala-based ML libraries to solve classification and regression problems using linear regression, generalized linear regression, logistic regression, support vector machine, and Naïve Bayes algorithms. It also covers tree-based ensemble techniques for solving both classification and regression problems. Moving ahead, it covers unsupervised learning techniques, such as dimensionality reduction, clustering, and recommender systems. Finally, it provides a brief overview of deep learning using a real-life example in Scala.
Preface
Free Chapter
Introduction to Machine Learning with Scala
Scala for Regression Analysis
Scala for Learning Classification
Scala for Tree-Based Ensemble Techniques
Scala for Dimensionality Reduction and Clustering
Scala for Recommender System
Introduction to Deep Learning with Scala
Other Books You May Enjoy

Regression analysis algorithms

There are numerous algorithms proposed and available, which can be used for the regression analysis. For example, LR tries to find relationships and dependencies between variables. It models the relationship between a continuous dependent variable y (that is, a label or target) and one or more independent variables, x, using a linear function. Examples of regression algorithms include the following:

• Linear regression (LR)
• Generalized linear regression (GLR)
• Survival regression (SR)
• Isotonic regression (IR)
• Decision tree regressor (DTR)
• Random forest regression (RFR)
• Gradient boosted trees regression (GBTR)

We start by explaining regression with the simplest LR algorithm, which models the relationship between a dependent variable, y, which involves a linear combination of interdependent variables, x:

In the preceding equation letters, β0 and β1 are two constants for y-axis intercept and the slope of the line, respectively. LR is about learning a model, which is a linear combination of features of the input example (data points).

Take a look at the following graph and imagine that the red line is not there. We have a few dotted blue points (data points). Can we reasonably develop a machine learning (regression) model to separate most of them? Now, if we draw a straight line between two classes of data, those get almost separated, don't they? Such a line (red in our case) is called the decision boundary, which is also called the regression line in the case of regression analysis (see the following example for more):

If we are given a collection of labeled examples, say , where N is the number of samples in the dataset, xi is the D-dimensional feature vector of the samples i = 1, 2… N, and yi is a real-valued y ∈ R, where R denotes the set of all real numbers called the target variable and every feature xi is a real number. Then combining these, the next step is to build the following mathematical model, f:

Here, w is a D-dimensional parameterized vector and b is a real number. The notation fw,b signifies that the model f is parameterized by values w and b. Once we have a well-defined model, it can now be used for making a prediction of unknown y for a given x, that is, y ← fw,b (x). However, there is an issue, as since the model is parametrized with two different values (w, b), this will mean the model tends to produce two different predictions when applied to the same sample, even when coming from the same distribution.

Literally, it can be referred as an optimization problem—where the objective is to find the optimal (that is, minimum, for example) values such that the optimal values of parameters will mean the model tends to make more accurate predictions. In short, in the LR model, we intend to find the optimal values for and to minimize the following objective function:

In the preceding equation, the expression (f w,b (Xi) - yi)2 is called the loss function, which is a measure of penalty (that is, error or loss) for giving the wrong prediction for sample i. This loss function is in the form of squared error loss. However, other loss functions can be used too, as outlined in the following equations:

The squared error (SE) in equation 1 is called L2 loss, which is the default loss function for the regression analysis task. On the other hand, the absolute error (AE) in equation (2) is called L1 loss.

In cases where the dataset has many outliers, using L1 loss is recommend more than L2, because L1 is more robust against outliers.

All model-based learning algorithms have a loss function associated with them. Then we try to find the best model by minimizing the cost function. In our LR case, the cost function is defined by the average loss (also called empirical risk), which can be formulated as the average of all penalties obtained by fitting the model to the training data, which may contain many samples.

Figure 4 shows an example of simple linear regression. Let's say the idea is to predict the amount of Savings versus Age. So, in this case, we have one independent variable x (that is, a set of 1D data points and, in our case, the Age) and one dependent variable, y (amount of Savings (in millions \$)). Once we have a trained regression model, we can use this line to predict the value of the target yl for a new unlabeled input example, xl. However, in the case of D -dimensional feature vectors (for example, 2D or 3D), it would be a plane (for 2D) or a hyperplane (for >=3D):

Figure 4: A regression line separates data points to solve Age versus the amount of Savings: i) the left model separates data points based on training data: ii) the right model predicts for an unknown observation

Now you see why it is important to have the requirement that the regression hyperplane lies as close to the training examples as possible: if the blue line in Figure 4 (the model on the right) is far away from the blue dots, the prediction yl is less likely to be correct. The best fit line, which is expected to pass through most of the data points, is the result of the regression analysis. However, in practice it does not pass through all of the data points because of the existence of regression errors.

Regression error is the distance between any data points (actual) and the line (predicted).

Since solving a regression problem is itself an optimization problem, we expect a smaller margin for errors as possible because smaller errors contribute towards higher predictive accuracy, while predicting unseen observations. Although an LR algorithm is not so efficient in many cases, the nicest thing is that an LR model usually does not overfit, which is unlikely for a more complex model.

In the previous chapter, we discussed overfitting (a phenomenon whereby a model that shows a model predicts very well during the training but makes more errors when applied to test set) and underfitting (if your training error is low and your validation error is high, then your model is most likely overfitting your training data). Often these two phenomena occur due to bias and variance.

Performance metrics

To measure the predictive performance of a regression model, several metrics are proposed and in use in terms of regression errors, which can be outlined as follows:

• Mean squared error (MSE): It is the measure of the difference between the predicted and estimated values, that is, how close a fitted line is to data points. The smaller the MSE, the closer the fit is to the data.
• Root mean squared error (RMSE): It is the square root of the MSE but has the same units as the quantity plotted on the vertical axis.
• R-squared: It is the coefficient of determination for assessing how close the data is to the fitted regression line ranges between 0 and 1. The higher the R-squared, the better the model fits your data.
• Mean absolute error (MAE): It is a measure of accuracy for continuous variables without considering their direction. The smaller the MAE, the better the model fits your data.

Now that we know how a regression algorithm works and how to evaluate the performance using several metrics, the next important task is to apply this knowledge to solve a real-life problem.