In the previous chapter, we already gained some basic understanding of the machine learning (ML) process, as we have seen the basic distinction between regression and classification. Regression analysis is a set of statistical processes for estimating the relationships between a set of variables called a dependent variable and one or multiple independent variables. The values of dependent variables depend on the values of independent variables.
A regression analysis technique helps us to understand this dependency, that is, how the value of the dependent variable changes when any one of the independent variables is changed, while the other independent variables are held fixed. For example, let's assume that there will be more savings in someone's bank when they grow older. Here, the amount of Savings (say in million $) depends on age (that is, Age in years, for example):
Age (years) |
Savings (million $) |
40 |
1.5 |
50 |
5.5 |
60 |
10.8 |
70 |
6.7 |
So, we can plot these two values in a 2D plot, where the dependent variable (Savings) is plotted on the y-axis and the independent variable (Age) should be plotted on the x-axis. Once these data points are plotted, we can see correlations. If the theoretical chart indeed represents the impact of getting older on savings, then we'll be able to say that the older someone gets, the more savings there will be in their bank account.
Now the question is how can we tell the degree to which age helps someone to get more money in their bank account? To answer this question, one can draw a line through the middle of all of the data points on the chart. This line is called the regression line, which can be calculated precisely using a regression analysis algorithm. A regression analysis algorithm takes either discrete or continuous (or both) input features and produces continuous values.
Making a prediction using such a regression model on unseen and new observations is like creating a data pipeline with multiple components working together, where we observe an algorithm's performance in two stages: learning and inference. In the whole process and for making the predictive model a successful one, data acts as the first-class citizen in all ML tasks.