Book Image

Training Systems using Python Statistical Modeling

By : Curtis Miller
Book Image

Training Systems using Python Statistical Modeling

By: Curtis Miller

Overview of this book

Python's ease-of-use and multi-purpose nature has made it one of the most popular tools for data scientists and machine learning developers. Its rich libraries are widely used for data analysis, and more importantly, for building state-of-the-art predictive models. This book is designed to guide you through using these libraries to implement effective statistical models for predictive analytics. You’ll start by delving into classical statistical analysis, where you will learn to compute descriptive statistics using pandas. You will focus on supervised learning, which will help you explore the principles of machine learning and train different machine learning models from scratch. Next, you will work with binary prediction models, such as data classification using k-nearest neighbors, decision trees, and random forests. The book will also cover algorithms for regression analysis, such as ridge and lasso regression, and their implementation in Python. In later chapters, you will learn how neural networks can be trained and deployed for more accurate predictions, and understand which Python libraries can be used to implement them. By the end of this book, you will have the knowledge you need to design, build, and deploy enterprise-grade statistical models for machine learning using Python and its rich ecosystem of libraries for predictive analytics.
Table of Contents (9 chapters)

Finding correlations

In the final section of this chapter, we will learn about computing correlations using pandas and SciPy. We will look at how to use pandas and SciPy to compute correlations in datasets, and also explore some statistical tests to detect correlation.

In this section, I have used Pearson's correlation coefficient, which quantifies how strongly two variables are linearly correlated. This is a unitless number that takes values between -1 and 1. The sign of the correlation coefficient indicates the direction of the relationship. A positive r indicates that as one variable increases, the other tends to increase; while a negative r indicates that as one variable increases, the other tends to decrease. The magnitude indicates the strength of the relationship. If r is close to 1 or -1, then the relationship is almost a perfect linear relationship, while an r that is close to 0 indicates no linear relationship. The NumPy corrcoef() function computes the number for two NumPy arrays.

So, here's the correlation coefficient's definition. We're going to work with what's known as the Boston housing price dataset. This is known as a toy dataset, and is often used for evaluating statistical learning methods. I'm only interested in looking at correlations between the variables in this dataset.

Note that we will be using a slightly modified version of the dataset, which can be found in the GitHub repository for this chapter.

We will use the following steps:

  1. We will first load up all the required libraries, as follows:
  1. We will then load in the dataset using the following code:
  1. Then, we will print the dataset information, so that we can have a look at it, as follows:

So, these attributes tell us what this dataset contains. It has a few different variables such as, for example, the crime rate by town, the proportion of residential land zones, non-retail business acres, and others. Therefore, we have a number of different attributes for you to play with.

  1. We will now load the actual dataset, as follows:

This is going to very useful to us. Now, let's take a look at the correlation between the two variables.

  1. We're going to import the corrcoef() function, and we're going to look at one of the columns from this dataset as if it were a NumPy array, as follows:

It started out as a NumPy array, but it's perfectly fine to just grab a pandas series and turn it into an array.

  1. Now we're going to take corrcoef(), and compute the correlation between the crime rate and the price:

We can see that the numbers in the matrix are the same.

The number in the off-diagonal entries corresponds to the correlation between the two variables. Therefore, there is a negative relationship between crime rate and price, which is not surprising. As the crime rate increases, you would expect the price to decrease. But this is not a very strong correlation. We often want correlations not just for two variables, but for many combinations of two variables. We can use the pandas DataFrame corr() method to compute what's known as the correlation matrix. This matrix contains correlations for every combination of variables in our dataset.

  1. So, let's go ahead and compute a correlation matrix for the Boston dataset, using the corr() function:

These are all correlations. Every entry off the diagonal will have a corresponding entry on the other side of the diagonal—this is a symmetric matrix. Now, while this matrix contains all the data that we might want, it is very difficult to read.

  1. So, let's use a package called seaborn, as this is going to allow us to plot a heatmap. A heatmap has colors with differing intensities for how large or small a number tends to be. So, we compute a heatmap and plot it as follows:

The preceding screenshot shows the resulting map. Here, black indicates correlations close to -1, and very light colors indicate correlations that are close to 1. From this heatmap, we can start to see some patterns.

Testing for correlation

SciPy contains the pearsonr() function in its statistics module, which not only computes the correlation between two variables but also allows for hypothesis testing to detect correlation. Rejecting the null hypothesis will signify that the correlation between the two variables, in general, is not 0. However, be aware that rejecting the null hypothesis does not automatically signify a meaningful relationship between the two variables. It's possible that the correlation is not 0, but still too small to be meaningful. This is especially true when performing the test on large samples.

We're going to do a statistical test using the following hypotheses:

We import the pearsonr() function, and we run it on crime and price to see whether these things are correlated with statistical significance:

Here, we have a very small p value, which suggests that yes, indeed, the crime rate and the price of a home appear to be correlated. If we had to take a guess, then we would say that it would be negatively correlated.