Book Image

The Data Science Workshop

By : Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare
Book Image

The Data Science Workshop

By: Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare

Overview of this book

You already know you want to learn data science, and a smarter way to learn data science is to learn by doing. The Data Science Workshop focuses on building up your practical skills so that you can understand how to develop simple machine learning models in Python or even build an advanced model for detecting potential bank frauds with effective modern data science. You'll learn from real examples that lead to real results. Throughout The Data Science Workshop, you'll take an engaging step-by-step approach to understanding data science. You won't have to sit through any unnecessary theory. If you're short on time you can jump into a single exercise each day or spend an entire weekend training a model using sci-kit learn. It's your choice. Learning on your terms, you'll build up and reinforce key skills in a way that feels rewarding. Every physical print copy of The Data Science Workshop unlocks access to the interactive edition. With videos detailing all exercises and activities, you'll always have a guided solution. You can also benchmark yourself against assessments, track progress, and receive content updates. You'll even earn a secure credential that you can share and verify online upon completion. It's a premium learning experience that's included with your printed copy. To redeem, follow the instructions located at the start of your data science book. Fast-paced and direct, The Data Science Workshop is the ideal companion for data science beginners. You'll learn about machine learning algorithms like a data scientist, learning along the way. This process means that you'll find that your new skills stick, embedded as best practice. A solid foundation for the years ahead.
Table of Contents (18 chapters)

Explaining the Results of Regression Analysis

A primary objective of regression analysis is to find a model that explains the variability observed in a dependent variable of interest. It is, therefore, very important to have a quantity that measures how well a regression model explains this variability. The statistic that does this is called R-squared (R2). Sometimes, it is also called the coefficient of determination. To understand what it actually measures, we need to take a look at some other definitions.

The first of these is called the Total Sum of Squares (TSS). TSS gives us a measure of the total variance found in the dependent variable from its mean value.

The next quantity is called the Regression sum of squares (RSS). This gives us a measure of the amount of variability in the dependent variable that our model explains. If you imagine us creating a perfect model with no errors in prediction, then TSS will be equal to RSS. Our hypothetically perfect model will provide an explanation for all the variability we see in the dependent variable with respect to the mean value. In practice, this rarely happens. Instead, we create models that are not perfect, so RSS is less than TSS. The missing amount by which RSS falls short of TSS is the amount of variability in the dependent variable that our regression model is not able to explain. That quantity is the Error Sum of Squares (ESS), which is essentially the sum of the residual terms of our model.

R-squared is the ratio of RSS to TSS. This, therefore, gives us a percentage measure of how much variability our regression model is able to explain compared to the total variability in the dependent variable with respect to the mean. R2 will become smaller when RSS grows smaller and vice versa. In the case of simple linear regression where the independent variable is one, R2 is enough to decide the overall fit of the model to the data.

There is a problem, however, when it comes to multiple linear regression. The R2 is known to be sensitive to the addition of extra independent variables to the model, even if the independent variable is only slightly correlated to the dependent variable. Its addition will increase R2. Depending on R2 alone to make a decision between models defined for the same dependent variable will lead to chasing a complex model that has many independent variables in it. This complexity is not helpful practically. In fact, it may lead to a problem in modeling called overfitting.

To overcome this problem, the Adjusted R2 (denoted Adj. R-Squared on the output of statsmodels) is used to select between models defined for the same dependent variable. Adjusted R2 will increase only when the addition of an independent variable to the model contributes to explaining the variability in the dependent variable in a meaningful way.

In Activity 2.02, our model explained 88 percent of the variability in the transformed dependent variable, which is really good. We started with simple models and worked to improve the fit of the models using different techniques. All the exercises and activities done in this chapter have pointed out that the regression analysis workflow is iterative. You start by plotting to get a visual picture and follow from there to improve upon the model you finally develop by using different techniques. Once a good model has been developed, the next step is to validate the model statistically before it can be used for making a prediction or acquiring insight for decision making. Next, let's discuss what validating the model statistically means.

Regression Analysis Checks and Balances

In the preceding discussions, we used the R-squared and the Adjusted R-squared statistics to assess the goodness of fit of our models. While the R-squared statistic provides an estimate of the strength of the relationship between a model and the dependent variable(s), it does not provide a formal statistical hypothesis test for this relationship.

What do we mean by a formal statistical hypothesis test for a relationship between a dependent variable and some independent variable(s) in a model?

We must recall that, to say an independent variable has a relationship with a dependent variable in a model, the coefficient (β) of that independent variable in the regression model must not be zero (0). It is well and good to conduct a regression analysis with our Boston Housing dataset and find an independent variable (say the median value of owner-occupied homes) in our model to have a nonzero coefficient (β).

The question is will we (or someone else) find the median value of owner-occupied homes as having a nonzero coefficient (β), if we repeat this analysis using a different sample of Boston Housing dataset taken at different locations or times? Is the nonzero coefficient for the median value of owner-occupied homes, found in our analysis, specific to our sample dataset and zero for any other Boston Housing data sample that may be collected? Did we find the nonzero coefficient for the median value of owner-occupied homes by chance? These questions are what hypothesis tests seek to clarify. We cannot be a hundred percent sure that the nonzero coefficient (β) of an independent variable is by chance or not. But hypothesis testing gives a framework by which we can calculate the level of confidence where we can say that the nonzero coefficient (β) found in our analysis is not by chance. This is how it works.

We first agree a level of risk (α-value or α-risk or Type I error) that may exist that the nonzero coefficient (β) may have been found by chance. The idea is that we are happy to live with this level of risk of making the error or mistake of claiming that the coefficient (β) is nonzero when in fact it is zero.

In most practical analyses, the α-value is set at 0.05, which is 5 in percentage terms. When we subtract the α-risk from one (1-α) we have a measure of the level of confidence that we have that the nonzero coefficient (β) found in our analysis did not come about by chance. So, our confidence level is 95% at 5% α-value.

We then go ahead to calculate a probability value (usually called the p-value), which gives us a measure of the α-risk related to the coefficient (β) of interest in our model. We compare the p-value to our chosen α-risk, and if the p-value is less than the agreed α-risk, we reject the idea that the nonzero coefficient (β) was found by chance. This is because the risk of making a mistake of claiming the coefficient (β) is nonzero is within the acceptable limit we set for ourselves earlier.

Another way of stating that the nonzero coefficient (β) was NOT found by chance is to say that the coefficient (β) is statistically significant or that we reject the null hypothesis (the null hypothesis being that there is no relationship between the variables being studied). We apply these ideas of statistical significance to our models in two stages:

  1. In stage one, we validate the model as a whole statistically.
  2. In stage two, we validate the independent variables in our model individually for statistical significance.

The F-test

The F-test is what validates the overall statistical significance of the strength of the relationship between a model and its dependent variables. If the p-value for the F-test is less than the chosen α-level (0.05, in our case), we reject the null hypothesis and conclude that the model is statistically significant overall.

When we fit a regression model, we generate an F-value. This value can be used to determine whether the test is statistically significant. In general, an increase in R2 increases the F-value. This means that the larger the F-value, the better the chances of the overall statistical significance of a model.

A good F-value is expected to be larger than one. The model in Figure 2.19 has an F-statistic value of 261.5, which is larger than one, and a p-value (Prob (F-statistic)) of approximately zero. The risk of making a mistake and rejecting the null hypothesis when we should not (known as a Type I error in hypothesis testing), is less than the 5% limit we chose to live with at the beginning of the hypothesis test. Because the p-value is less than 0.05, we reject the null hypothesis about our model in Figure 2.19. Therefore, we state that the model is statistically significant at the chosen 95% confidence level.

The t-test

Once a model has been determined to be statistically significant globally, we can proceed to examine the significance of individual independent variables in the model. In Figure 2.19, the p-values (denoted p>|t| in Section 2) for the independent variables are provided. The p-values were calculated using the t-values also given on the summary results. The process is not different from what was just discussed for the global case. We compare the p-values to the 0.05 α-level. If an independent variable has a p-value of less than 0.05, the independent variable is statistically significant in our model in explaining the variability in the dependent variable. If the p-value is 0.05 or higher, the particular independent variable (or term) in our model is not statistically significant. What this means is that that term in our model does not contribute toward explaining the variability in our dependent variable statistically. A close inspection of Figure 2.19 shows that some of the terms have p-values larger than 0.05. These terms don't contribute in a statistically significant way of explaining the variability in our transformed dependent variable. To improve this model, those terms will have to be dropped and a new model tried. It is clear by this point that the process of building a regression model is truly iterative.