Book Image

Training Systems using Python Statistical Modeling

By : Curtis Miller
Book Image

Training Systems using Python Statistical Modeling

By: Curtis Miller

Overview of this book

Python's ease-of-use and multi-purpose nature has made it one of the most popular tools for data scientists and machine learning developers. Its rich libraries are widely used for data analysis, and more importantly, for building state-of-the-art predictive models. This book is designed to guide you through using these libraries to implement effective statistical models for predictive analytics. You’ll start by delving into classical statistical analysis, where you will learn to compute descriptive statistics using pandas. You will focus on supervised learning, which will help you explore the principles of machine learning and train different machine learning models from scratch. Next, you will work with binary prediction models, such as data classification using k-nearest neighbors, decision trees, and random forests. The book will also cover algorithms for regression analysis, such as ridge and lasso regression, and their implementation in Python. In later chapters, you will learn how neural networks can be trained and deployed for more accurate predictions, and understand which Python libraries can be used to implement them. By the end of this book, you will have the knowledge you need to design, build, and deploy enterprise-grade statistical models for machine learning using Python and its rich ecosystem of libraries for predictive analytics.
Table of Contents (9 chapters)

Classical inference for proportions

In classical statistical inference, we often answer questions about a population, which is a hypothetical group of all possible values and data (including future ones). A sample, on the other hand, is a subset of the population that we use to observe values. In classical statistical inference, we often seek to answer questions about a fixed, non-random, unknown population parameter.

Confidence intervals are computed from data, and are expected to contain θ. We may refer to, say, a 95% confidence intervalthat is, an interval that we are 95% confident contains θ, in the sense that there is a 95% chance that when we compute such an interval, we capture θ in it.

This section focuses on binary variables, where the variable is either a success or a failure, and successes occur with a proportion or probability of p.

An example situation of this is tracking whether a visitor to a website clicked on an ad during their visit. Often, these variables are encoded numerically, with 1 for success, and 0 for a failure.

In classical statistics, we assume that our data is a random sample drawn from a population with a fixed, yet unknown, proportion, p. We can construct a confidence interval based on the sample proportion, which gives us an idea of the proportion of the population. A 95% confidence interval captures the proportion of the population approximately 95% of the time. We can construct confidence intervals using the proportion_confint() function, which is found in the statsmodel package, which allows the easy computation of confidence intervals. Let's now see this in action!

Computing confidence intervals for proportions

The sample proportion is computed by counting the number of successes and dividing this by the total sample size. This can be better explained using the following formula:

Here, N is the sample size and M is the number of success variables; this gives you the sample proportion of successes.

Now, we want to be able to make a statement about the population proportion, which is a fixed, yet unknown, quantity. We will construct a confidence interval for this proportion, using the following formula:

Here, zp is the 100 × pth percentile of the normal distribution.

Now, let's suppose that, on a certain website, out of 1,126 visitors, 310 clicked on a certain ad. Let's construct a confidence interval for the population proportion of visitors who clicked on the ad. This will allow us to predict future clicks. We will use the following steps to do so:

  1. Let's first load the data in the statsmodels package and actually compute the sample proportion, which, in this case, is 310 out of 1,126:

You can see that appropriately 28% of the visitors to the website clicked on the ad on that day.

  1. Our next step is to actually construct a confidence interval using the proportion_confint() function. We assign the number of successes in the count variable, the number of trials in the nobs variable, and the confidence in the alpha variable, as shown in the following code snippet:

As you can see here, with 95% confidence, the proportion is between approximately 25% and 30%.

  1. If we wanted a larger confidence interval, that is, a 99% confidence interval, then we could specify a different alpha, as follows:

Hypothesis testing for proportions

With hypothesis testing, we attempt to decide between two competing hypotheses that are statements about the value of the population proportion. These hypotheses are referred to as the null or alternative hypotheses; this idea is better illustrated in the following diagram:

If the sample is unlikely to be seen at the null hypothesis for true, then we reject the null hypothesis and assume that the alternative hypothesis must be true. We measure how unlikely a sample is by computing a p value, using a test statistic. p values represent the probability of observing a test statistic that is, at least, as contradictory to the null hypothesis as the one computed. Small p values indicate stronger evidence against the null hypothesis. Statisticians often introduce a cutoff and say that if the p value is less than, say, 0.05, then we should reject the null hypothesis in favor of the alternative. We can choose any cutoff we want, depending on how strong we want the evidence against the null hypothesis to be before rejecting it. I don't recommend making your cutoff greater than 0.05. So, let's examine this in action.

Let's say that the website's administrator claims that 30% of visitors to the website clicked on the advertisement—is this true? Well, the sample proportion will never exactly match this number, but we can still decide whether the sample proportion is evidence against this number. So, we're going to test the null hypothesis that p = 0.3, which is what the website administrator claims, against the alternative hypothesis that p0.3. So, now let's go ahead and compute the p value.

First, we're going to import the proportions_ztest() function. We give it how many successes there were in the data, the total number of observations, the value of p under the null hypothesis, and, additionally, we tell it what type of alternative hypothesis we're using:

We can see the result here; the first value is the test statistic and the second one is the p value. In this case, the p value is 0.0636, which is greater than 0.05. Since this is greater than our cutoff, we conclude that there is not enough statistical evidence to disagree with the website administrator.

Testing for common proportions

Now, let's move on to comparing the proportions between two samples. We want to know whether the samples were drawn from populations with the same proportion or not. This could show up in the context, such as A/B testing, where a website wants to determine which of two types of ads generates more clicks.

We can still use the statsmodels function, proportions_ztest(), but we now pass NumPy arrays to the count and nobs arguments, which contain the relevant data for the two samples.

So, our website wants to conduct an experiment. The website will show some of its visitors different versions of an advertisement created by a sponsor. Users are randomly assigned by the server to either Version A or Version B. The website will track how often Version A was clicked on and how often Version B was clicked on by its users. On a particular day, 516 visitors saw Version A and 510 saw Version B. 108 of those who saw Version A clicked on it, and 144 who saw Version B clicked on it. So, we want to know which ad generated more clicks.

We're going to be testing the following hypotheses:

Let's go ahead and import the numpy library. When we import NumPy, we're going to use NumPy arrays to contain our data, as follows:

We will then assign the arrays and define the alternative as two-sided, as follows:

We end up with a p value of around 0.0066, which is small in our case, so we reject the null hypothesis. It appears that the two ads do not have the same proportion of clicks. We have looked at hypothesis testing for proportions. We will now look at applying everything that we have learned to mean values.