#### Overview of this book

Scala is a highly scalable integration of object-oriented nature and functional programming concepts that make it easy to build scalable and complex big data applications. This book is a handy guide for machine learning developers and data scientists who want to develop and train effective machine learning models in Scala. The book starts with an introduction to machine learning, while covering deep learning and machine learning basics. It then explains how to use Scala-based ML libraries to solve classification and regression problems using linear regression, generalized linear regression, logistic regression, support vector machine, and Naïve Bayes algorithms. It also covers tree-based ensemble techniques for solving both classification and regression problems. Moving ahead, it covers unsupervised learning techniques, such as dimensionality reduction, clustering, and recommender systems. Finally, it provides a brief overview of deep learning using a real-life example in Scala.
Preface
Free Chapter
Introduction to Machine Learning with Scala
Scala for Regression Analysis
Scala for Learning Classification
Scala for Tree-Based Ensemble Techniques
Scala for Dimensionality Reduction and Clustering
Scala for Recommender System
Introduction to Deep Learning with Scala
Other Books You May Enjoy

# An overview of regression analysis

In the previous chapter, we already gained some basic understanding of the machine learning (ML) process, as we have seen the basic distinction between regression and classification. Regression analysis is a set of statistical processes for estimating the relationships between a set of variables called a dependent variable and one or multiple independent variables. The values of dependent variables depend on the values of independent variables.

A regression analysis technique helps us to understand this dependency, that is, how the value of the dependent variable changes when any one of the independent variables is changed, while the other independent variables are held fixed. For example, let's assume that there will be more savings in someone's bank when they grow older. Here, the amount of Savings (say in million \$) depends on age (that is, Age in years, for example):

 Age (years) Savings (million \$) 40 1.5 50 5.5 60 10.8 70 6.7

So, we can plot these two values in a 2D plot, where the dependent variable (Savings) is plotted on the y-axis and the independent variable (Age) should be plotted on the x-axis. Once these data points are plotted, we can see correlations. If the theoretical chart indeed represents the impact of getting older on savings, then we'll be able to say that the older someone gets, the more savings there will be in their bank account.

Now the question is how can we tell the degree to which age helps someone to get more money in their bank account? To answer this question, one can draw a line through the middle of all of the data points on the chart. This line is called the regression line, which can be calculated precisely using a regression analysis algorithm. A regression analysis algorithm takes either discrete or continuous (or both) input features and produces continuous values.

A classification task is used for predicting the label of the class attribute, while a regression task is used for making a numeric prediction of the class attribute.

Making a prediction using such a regression model on unseen and new observations is like creating a data pipeline with multiple components working together, where we observe an algorithm's performance in two stages: learning and inference. In the whole process and for making the predictive model a successful one, data acts as the first-class citizen in all ML tasks.

# Learning

One of the important task at the learning stage is to prepare and convert the data into feature vectors (vectors of numbers out of each feature). Training data in feature vector format can be fed into the learning algorithms to train the model, which can be used for inferencing. Typically, and of course based on data size, running an algorithm may take hours (or even days) so that the features converge into a useful model as shown in the following diagram:

Learning and training a predictive model—it shows how to generate the feature vectors from the training data to train the learning algorithm that produces a predictive model

# Inferencing

In the inference stage, the trained model is used for making intelligent use of the model, such as predicting from never-before-seen data, making recommendations, and deducing future rules. Typically, it takes less time compared to the learning stage and sometimes even in real time. Thus, inferencing is all about testing the model against new (that is, unobserved) data and evaluating the performance of the model itself, as shown in the following diagram:

Inferencing from an existing model towards predictive analytics (feature vectors are generated from unknown data for making predictions)

In summary, when using regression analysis the goal is to predict a continuous target variable. Now that we know how to construct a basic workflow for a supervised learning task, knowing a little about available regression algorithms will provide a bit more concrete information on how to apply these regression algorithms.