There are numerous algorithms proposed and available, which can be used for the regression analysis. For example, LR tries to find relationships and dependencies between variables. It models the relationship between a continuous dependent variable *y* (that is, a label or target) and one or more independent variables, *x*, using a linear function. Examples of regression algorithms include the following:

**Linear regression**(**LR**)**Generalized linear regression**(**GLR**)**Survival regression**(**SR**)**Isotonic regression**(**IR**)**Decision tree regressor**(**DTR**)**Random forest regression**(**RFR**)**Gradient boosted trees regression**(**GBTR**)

We start by explaining regression with the simplest LR algorithm, which models the relationship between a dependent variable, *y*, which involves a linear combination of interdependent variables, *x*:

In the preceding equation letters, *β _{0}* and

*β*are two constants for

_{1}*y*-axis intercept and the slope of the line, respectively. LR is about learning a model, which is a linear combination of features of the input example (data points).

Take a look at the following graph and imagine that the red line is not there. We have a few dotted blue points (data points). Can we reasonably develop a machine learning (regression) model to separate most of them? Now, if we draw a straight line between two classes of data, those get almost separated, don't they? Such a line (red in our case) is called the decision boundary, which is also called the regression line in the case of regression analysis (see the following example for more):

If we are given a collection of labeled examples, say , where *N* is the number of samples in the dataset, *x _{i}* is the

*D*-dimensional feature vector of the samples

*i = 1, 2… N*, and

*y*is a real-valued

_{i}*y ∈ R*, where

*R*denotes the set of all real numbers called the target variable and every feature

*x*is a real number. Then combining these, the next step is to build the following mathematical model,

_{i }*f*:

Here, *w* is a *D*-dimensional parameterized vector and *b* is a real number. The notation *f _{w,b}* signifies that the model

*f*is parameterized by values

*w*and

*b*. Once we have a well-defined model, it can now be used for making a prediction of unknown

*y*for a given

*x,*that is,

*y ← f*. However, there is an issue, as since the model is parametrized with two different values (

_{w,b }(x)*w*,

*b*), this will mean the model tends to produce two different predictions when applied to the same sample, even when coming from the same distribution.

Literally, it can be referred as an optimization problem—where the objective is to find the optimal (that is, minimum, for example) values such that the optimal values of parameters will mean the model tends to make more accurate predictions. In short, in the LR model, we intend to find the optimal values for and to minimize the following objective function:

In the preceding equation, the expression *(f _{w,b} (X_{i}) - y_{i})^{2}* is called the

**loss function**, which is a measure of penalty (that is, error or loss) for giving the wrong prediction for sample

*i*. This loss function is in the form of squared error loss. However, other loss functions can be used too, as outlined in the following equations:

The **squared error** (**SE**) in equation 1 is called *L _{2}* loss, which is the default loss function for the regression analysis task. On the other hand, the

**absolute error**(

**AE**) in equation (

*2)*is called

*L*loss.

_{1}In cases where the dataset has many outliers, using *L _{1}* loss is recommend more than

*L*, because

_{2}*L*is more robust against outliers.

_{1}All model-based learning algorithms have a loss function associated with them. Then we try to find the best model by minimizing the cost function. In our LR case, the cost function is defined by the average loss (also called empirical risk), which can be formulated as the average of all penalties obtained by fitting the model to the training data, which may contain many samples.

*Figure 4* shows an example of simple linear regression. Let's say the idea is to predict the amount of **Savings** versus **Age**. So, in this case, we have one independent variable *x* (that is, a set of 1D data points and, in our case, the **Age**) and one dependent variable, *y* (amount of **Savings (in millions $)**). Once we have a trained regression model, we can use this line to predict the value of the target *y _{l}* for a new unlabeled input example,

*x*However, in the case of

_{l}.*D*-dimensional feature vectors (for example,

*2D*or

*3D*), it would be a plane (for

*2D*) or a hyperplane (for

*>=3D*):

Now you see why it is important to have the requirement that the regression hyperplane lies as close to the training examples as possible: if the blue line in *Figure 4* (the model on the right) is far away from the blue dots, the prediction *y _{l}* is less likely to be correct. The best fit line, which is expected to pass through most of the data points, is the result of the regression analysis. However, in practice it does not pass through all of the data points because of the existence of regression errors.

Since solving a regression problem is itself an optimization problem, we expect a smaller margin for errors as possible because smaller errors contribute towards higher predictive accuracy, while predicting unseen observations. Although an LR algorithm is not so efficient in many cases, the nicest thing is that an LR model usually does not overfit, which is unlikely for a more complex model.

In the previous chapter, we discussed overfitting (a phenomenon whereby a model that shows a model predicts very well during the training but makes more errors when applied to test set) and underfitting (if your training error is low and your validation error is high, then your model is most likely overfitting your training data). Often these two phenomena occur due to bias and variance.