# Statistical foundations

When we want to make observations about the data we are analyzing, we often, if not always, turn to statistics in some fashion. The data we have is referred to as the **sample**, which was observed from (and is a subset of) the **population**. Two broad categories of statistics are descriptive and inferential statistics. With **descriptive statistics**, as the name implies, we are looking to *describe* the sample. **Inferential statistics** involves using the sample statistics to *infer*, or deduce, something about the population, such as the underlying distribution.

Important note

Sample statistics are used as **estimators** of the population parameters, meaning that we have to quantify their bias and variance. There is a multitude of methods for this; some will make assumptions on the shape of the distribution (parametric) and others won't (non-parametric). This is all well beyond the scope of this book, but it is good to be aware of.

Often, the goal of an analysis is to create a story for the data; unfortunately, it is very easy to misuse statistics. It's the subject of a famous quote:

This is especially true of inferential statistics, which is used in many scientific studies and papers to show the significance of the researchers' findings. This is a more advanced topic and, since this isn't a statistics book, we will only briefly touch upon some of the tools and principles behind inferential statistics, which can be pursued further. We will focus on descriptive statistics to help explain the data we are analyzing.

## Sampling

There's an important thing to remember before we attempt any analysis: our sample must be a **random sample** that is representative of the population. This means that the data must be sampled without bias (for example, if we are asking people whether they like a certain sports team, we can't only ask fans of the team) and that we should have (ideally) members of all distinct groups from the population in our sample (in the sports team example, we can't just ask men).

When we discuss machine learning in *Chapter 9*, *Getting Started with Machine Learning in Python*, we will need to sample our data, which will be a sample to begin with. This is called **resampling**. Depending on the data, we will have to pick a different method of sampling. Often, our best bet is a **simple random sample**: we use a random number generator to pick rows at random. When we have distinct groups in the data, we want our sample to be a **stratified random sample**, which will preserve the proportion of the groups in the data. In some cases, we don't have enough data for the aforementioned sampling strategies, so we may turn to random sampling with replacement (**bootstrapping**); this is called a **bootstrap sample**. Note that our underlying sample needs to have been a random sample or we risk increasing the bias of the estimator (we could pick certain rows more often because they are in the data more often if it was a convenience sample, while in the true population these rows aren't as prevalent). We will see an example of bootstrapping in *Chapter 8*, *Rule-Based Anomaly Detection*.

Important note

A thorough discussion of the theory behind bootstrapping and its consequences is well beyond the scope of this book, but watch this video for a primer: https://www.youtube.com/watch?v=gcPIyeqymOU.

You can read more about sampling methods, along with their strengths and weaknesses, at https://www.khanacademy.org/math/statistics-probability/designing-studies/sampling-methods-stats/a/sampling-methods-review.

## Descriptive statistics

We will begin our discussion of descriptive statistics with **univariate statistics**; univariate simply means that these statistics are calculated from one (**uni**) variable. Everything in this section can be extended to the whole dataset, but the statistics will be calculated per variable we are recording (meaning that if we had 100 observations of speed and distance pairs, we could calculate the averages across the dataset, which would give us the average speed and average distance statistics).

Descriptive statistics are used to describe and/or summarize the data we are working with. We can start our summarization of the data with a measure of **central tendency**, which describes where most of the data is centered around, and a measure of **spread** or **dispersion**, which indicates how far apart values are.

### Measures of central tendency

Measures of central tendency describe the center of our distribution of data. There are three common statistics that are used as measures of center: mean, median, and mode. Each has its own strengths, depending on the data we are working with.

#### Mean

Perhaps the most common statistic for summarizing data is the average, or **mean**. The population mean is denoted by *μ* (the Greek letter *mu*), and the sample mean is written as (pronounced *X-bar*). The sample mean is calculated by summing all the values and dividing by the count of values; for example, the mean of the numbers 0, 1, 1, 2, and 9 is 2.6 (`(0 + 1 + 1 + 2 + 9)/5`

):

We use *x*i to represent the *i*th* *observation of the variable *X*. Note how the variable as a whole is represented with a capital letter, while the specific observation is lowercase. *Σ* (the Greek capital letter *sigma*) is used to represent a summation, which, in the equation for the mean, goes from *1* to *n*, which is the number of observations.

One important thing to note about the mean is that it is very sensitive to **outliers** (values created by a different generative process than our distribution). In the previous example, we were dealing with only five values; nevertheless, the 9 is much larger than the other numbers and pulled the mean higher than all but the 9. In cases where we suspect outliers to be present in our data, we may want to instead use the median as our measure of central tendency.

#### Median

Unlike the mean, the **median** is robust to outliers. Consider income in the US; the top 1% is much higher than the rest of the population, so this will skew the mean to be higher and distort the perception of the average person's income. However, the median will be more representative of the average income because it is the 50th percentile of our data; this means that 50% of the values are greater than the median and 50% are less than the median.

Tip

The *i*th percentile is the value at which *i*% of the observations are less than that value, so the 99th percentile is the value in *X* where 99% of the *x*'s are less than it.

The median is calculated by taking the middle value from an ordered list of values; in cases where we have an even number of values, we take the mean of the middle two values. If we take the numbers 0, 1, 1, 2, and 9 again, our median is 1. Notice that the mean and median for this dataset are different; however, depending on the distribution of the data, they may be the same.

#### Mode

The **mode** is the most common value in the data (if we, once again, have the numbers 0, 1, 1, 2, and 9, then 1 is the mode). In practice, we will often hear things such as *the distribution is bimodal or multimodal* (as opposed to unimodal) in cases where the distribution has two or more most popular values. This doesn't necessarily mean that each of them occurred the same amount of times, but rather, they are more common than the other values by a significant amount. As shown in the following plots, a unimodal distribution has only one mode (at **0**), a bimodal distribution has two (at **-2** and **3**), and a multimodal distribution has many (at **-2**, **0.4**, and **3**):

Understanding the concept of the mode comes in handy when describing continuous distributions; however, most of the time when we're describing our continuous data, we will use either the mean or the median as our measure of central tendency. When working with categorical data, on the other hand, we will typically use the mode.

### Measures of spread

Knowing where the center of the distribution is only gets us partially to being able to summarize the distribution of our data—we need to know how values fall around the center and how far apart they are. Measures of spread tell us how the data is dispersed; this will indicate how thin (low dispersion) or wide (very spread out) our distribution is. As with measures of central tendency, we have several ways to describe the spread of a distribution, and which one we choose will depend on the situation and the data.

#### Range

The **range** is the distance between the smallest value (**minimum**) and the largest value (**maximum**). The units of the range will be the same units as our data. Therefore, unless two distributions of data are in the same units and measuring the same thing, we can't compare their ranges and say one is more dispersed than the other:

Just from the definition of the range, we can see why it wouldn't always be the best way to measure the spread of our data. It gives us upper and lower bounds on what we have in the data; however, if we have any outliers in our data, the range will be rendered useless.

Another problem with the range is that it doesn't tell us how the data is dispersed around its center; it really only tells us how dispersed the entire dataset is. This brings us to the variance.

#### Variance

The **variance** describes how far apart observations are spread out from their average value (the mean). The population variance is denoted as *σ*2 (pronounced *sigma-squared*), and the sample variance is written as *s*2. It is calculated as the average squared distance from the mean. Note that the distances must be squared so that distances below the mean don't cancel out those above the mean.

If we want the sample variance to be an unbiased estimator of the population variance, we divide by *n - 1* instead of *n* to account for using the sample mean instead of the population mean; this is called Bessel's correction (https://en.wikipedia.org/wiki/Bessel%27s_correction). Most statistical tools will give us the sample variance by default, since it is *very* rare that we would have data for the entire population:

The variance gives us a statistic with *squared* units. This means that if we started with data on income in dollars ($), then our variance would be in dollars squared ($2). This isn't really useful when we're trying to see how this describes the data; we can use the **magnitude** (size) itself to see how spread out something is (large values = large spread), but beyond that, we need a measure of spread with units that are the same as our data. For this purpose, we use the standard deviation.

#### Standard deviation

We can use the **standard deviation** to see how far from the mean data points are *on average*. A small standard deviation means that values are close to the mean, while a large standard deviation means that values are dispersed more widely. This is tied to how we would imagine the distribution curve: the smaller the standard deviation, the thinner the peak of the curve (**0.5**); the larger the standard deviation, the wider the peak of the curve (**2**):

The standard deviation is simply the square root of the variance. By performing this operation, we get a statistic in units that we can make sense of again ($ for our income example):

Note that the population standard deviation is represented as *σ*, and the sample standard deviation is denoted as *s*.

#### Coefficient of variation

When we moved from variance to standard deviation, we were looking to get to units that made sense; however, if we then want to compare the level of dispersion of one dataset to another, we would need to have the same units once again. One way around this is to calculate the **coefficient of variation** (**CV**), which is unitless. The CV is the ratio of the standard deviation to the mean:

We will use this metric in *Chapter 7*, *Financial Analysis – Bitcoin and the Stock Market*; since the CV is unitless, we can use it to compare the volatility of different assets.

#### Interquartile range

So far, other than the range, we have discussed mean-based measures of dispersion; now, we will look at how we can describe the spread with the median as our measure of central tendency. As mentioned earlier, the median is the 50th percentile or the 2nd **quartile** (Q2). Percentiles and quartiles are both **quantiles**—values that divide data into equal groups each containing the same percentage of the total data. Percentiles divide the data into 100 parts, while quartiles do so into four (25%, 50%, 75%, and 100%).

Since quantiles neatly divide up our data, and we know how much of the data goes in each section, they are a perfect candidate for helping us quantify the spread of our data. One common measure for this is the **interquartile range** (**IQR**), which is the distance between the 3rd and 1st quartiles:

The IQR gives us the spread of data around the median *and* quantifies how much dispersion we have in the middle 50% of our distribution. It can also be useful when checking the data for outliers, which we will cover in *Chapter 8*, *Rule-Based Anomaly Detection*. In addition, the IQR can be used to calculate a unitless measure of dispersion, which we will discuss next.

#### Quartile coefficient of dispersion

Just like we had the coefficient of variation when using the mean as our measure of central tendency, we have the **quartile coefficient of dispersion** when using the median as our measure of center. This statistic is also unitless, so it can be used to compare datasets. It is calculated by dividing the **semi-quartile range** (half the IQR) by the **midhinge** (midpoint between the first and third quartiles):

We will see this metric again in *Chapter 7*, *Financial Analysis – Bitcoin and the Stock Market*, when we assess stock volatility. For now, let's take a look at how we can use measures of central tendency and dispersion to summarize our data.

### Summarizing data

We have seen many examples of descriptive statistics that we can use to summarize our data by its center and dispersion; in practice, looking at the **5-number summary** and visualizing the distribution prove to be helpful first steps before diving into some of the other aforementioned metrics. The 5-number summary, as its name indicates, provides five descriptive statistics that summarize our data:

A **box plot** (or box and whisker plot) is a visual representation of the 5-number summary. The median is denoted by a thick line in the box. The top of the box is Q3 and the bottom of the box is Q1. Lines (whiskers) extend from both sides of the box boundaries toward the minimum and maximum. Based on the convention our plotting tool uses, though, they may only extend to a certain statistic; any values beyond these statistics are marked as outliers (using points). For this book in general, the lower bound of the whiskers will be **Q**1** – 1.5 * IQR** and the upper bound will be **Q**3** + 1.5 * IQR**, which is called the **Tukey box plot**:

While the box plot is a great tool for getting an initial understanding of the distribution, we don't get to see how things are distributed inside each of the quartiles. For this purpose, we turn to **histograms** for **discrete** variables (for instance, the number of people or books) and **kernel density estimates** (**KDEs**) for **continuous** variables (for instance, heights or time). There is nothing stopping us from using KDEs on discrete variables, but it is easy to confuse people that way. Histograms work for both discrete and continuous variables; however, in both cases, we must keep in mind that the number of bins we choose to divide the data into can easily change the shape of the distribution we see.

To make a histogram, a certain number of equal-width bins are created, and then bars with heights for the number of values we have in each bin are added. The following plot is a histogram with 10 bins, showing the three measures of central tendency for the same data that was used to generate the box plot in *Figure 1.6*:

Important note

In practice, we need to play around with the number of bins to find the best value. However, we have to be careful as this can misrepresent the shape of the distribution.

KDEs are similar to histograms, except rather than creating bins for the data, they draw a smoothed curve, which is an estimate of the distribution's **probability density function** (**PDF**). The PDF is for continuous variables and tells us how probability is distributed over the values. Higher values for the PDF indicate higher likelihoods:

When the distribution starts to get a little lopsided with long tails on one side, the mean measure of center can easily get pulled to that side. Distributions that aren't symmetric have some **skew** to them. A **left (negative) skewed distribution** has a long tail on the left-hand side; a **right (positive) skewed distribution** has a long tail on the right-hand side. In the presence of negative skew, the mean will be less than the median, while the reverse happens with a positive skew. When there is no skew, both will be equal:

Important note

There is also another statistic called **kurtosis**, which compares the density of the center of the distribution with the density at the tails. Both skewness and kurtosis can be calculated with the SciPy package.

Each column in our data is a **random variable**, because every time we observe it, we get a value according to the underlying distribution—it's not static. When we are interested in the probability of getting a value of *x* or less, we use the **cumulative distribution function** (**CDF**), which is the integral (area under the curve) of the PDF:

The probability of the random variable *X* being less than or equal to the specific value of *x* is denoted as *P(X ≤ x)*. With a continuous variable, the probability of getting exactly *x* is 0. This is because the probability will be the integral of the PDF from *x* to *x* (area under a curve with zero width), which is 0:

In order to visualize this, we can find an estimate of the CDF from the sample, called the **empirical cumulative distribution function** (**ECDF**). Since this is cumulative, at the point where the value on the *x*-axis is equal to *x*, the *y* value is the cumulative probability of *P(X ≤ x)*. Let's visualize **P(X ≤ 50)**, **P(X = 50)**, and **P(X > 50)** as an example:

In addition to examining the distribution of our data, we may find the need to utilize probability distributions for uses such as simulation (discussed in *Chapter 8*, *Rule-Based Anomaly Detection*) or hypothesis testing (see the *Inferential statistics* section); let's take a look at a few distributions that we are likely to come across.

### Common distributions

While there are many probability distributions, each with specific use cases, there are some that we will come across often. The **Gaussian**, or **normal**, looks like a bell curve and is parameterized by its mean (*μ*) and standard deviation (*σ*). The **standard normal** (*Z*) has a mean of 0 and a standard deviation of 1. Many things in nature happen to follow the normal distribution, such as heights. Note that testing whether a distribution is normal is not trivial—check the *Further reading* section for more information.

The **Poisson distribution** is a discrete distribution that is often used to model arrivals. The time between arrivals can be modeled with the **exponential distribution**. Both are defined by their mean, lambda (*λ*). The **uniform distribution** places equal likelihood on each value within its bounds. We often use this for random number generation. When we generate a random number to simulate a single success/failure outcome, it is called a **Bernoulli trial**. This is parameterized by the probability of success (*p*). When we run the same experiment multiple times (*n*), the total number of successes is then a **binomial** random variable. Both the Bernoulli and binomial distributions are discrete.

We can visualize both discrete and continuous distributions; however, discrete distributions give us a **probability mass function** (**PMF**) instead of a PDF:

We will use some of these distributions in *Chapter 8*, *Rule-Based Anomaly Detection*, when we simulate some login attempt data for anomaly detection.

### Scaling data

In order to compare variables from different distributions, we would have to **scale** the data, which we could do with the range by using **min-max scaling**. We take *each* data point, subtract the minimum of the dataset, then divide by the range. This **normalizes** our data (scales it to the range [0, 1]):

This isn't the only way to scale data; we can also use the mean and standard deviation. In this case, we would subtract the mean from each observation and then divide by the standard deviation to **standardize** the data. This gives us what is known as a **Z-score**:

We are left with a normalized distribution with a mean of 0 and a standard deviation (and variance) of 1. The Z-score tells us how many standard deviations from the mean each observation is; the mean has a Z-score of 0, while an observation of 0.5 standard deviations below the mean will have a Z-score of -0.5.

There are, of course, additional ways to scale our data, and the one we end up choosing will be dependent on our data and what we are trying to do with it. By keeping the measures of central tendency and measures of dispersion in mind, you will be able to identify how the scaling of data is being done in any other methods you come across.

### Quantifying relationships between variables

In the previous sections, we were dealing with univariate statistics and were only able to say something about the variable we were looking at. With multivariate statistics, we seek to quantify relationships between variables and attempt to make predictions for future behavior.

The **covariance** is a statistic for quantifying the relationship between variables by showing how one variable changes with respect to another (also referred to as their joint variance):

Important note

*E[X]* is a new notation for us. It is read as *the expected value of X* or *the expectation of X*, and it is calculated by summing all the possible values of *X* multiplied by their probability—it's the long-run average of *X*.

The magnitude of the covariance isn't easy to interpret, but its sign tells us whether the variables are positively or negatively correlated. However, we would also like to quantify how *strong* the relationship is between the variables, which brings us to correlation. **Correlation** tells us how variables change together both in direction (same or opposite) and magnitude (strength of the relationship). To find the correlation, we calculate the **Pearson correlation coefficient**, symbolized by *ρ* (the Greek letter *rho*), by dividing the covariance by the product of the standard deviations of the variables:

This normalizes the covariance and results in a statistic bounded between -1 and 1, making it easy to describe both the direction of the correlation (sign) and the strength of it (magnitude). Correlations of 1 are said to be perfect positive (linear) correlations, while those of -1 are perfect negative correlations. Values near 0 aren't correlated. If correlation coefficients are near 1 in absolute value, then the variables are said to be strongly correlated; those closer to 0.5 are said to be weakly correlated.

Let's look at some examples using scatter plots. In the leftmost subplot of *Figure 1.12* (**ρ = 0.11**), we see that there is no correlation between the variables: they appear to be random noise with no pattern. The next plot with **ρ = -0.52** has a weak negative correlation: we can see that the variables appear to move together with the *x* variable increasing, while the *y* variable decreases, but there is still a bit of randomness. In the third plot from the left (**ρ = 0.87**), there is a strong positive correlation: *x* and *y* are increasing together. The rightmost plot with **ρ = -0.99** has a near-perfect negative correlation: as *x* increases, *y* decreases. We can also see how the points form a line:

To quickly eyeball the strength and direction of the relationship between two variables (and see whether there even seems to be one), we will often use scatter plots rather than calculating the exact correlation coefficient. This is for a couple of reasons:

- It's easier to find patterns in visualizations, but it's more work to arrive at the same conclusion by looking at numbers and tables.
- We might see that the variables seem related, but they may not be
*linearly*related. Looking at a visual representation will make it easy to see if our data is actually quadratic, exponential, logarithmic, or some other non-linear function.

Both of the following plots depict data with strong positive correlations, but it's pretty obvious when looking at the scatter plots that these are not linear. The one on the left is logarithmic, while the one on the right is exponential:

It's very important to remember that while we may find a correlation between *X* and *Y*, it doesn't mean that *X causes Y* or that *Y causes X*. There could be some *Z* that actually causes both; perhaps *X* causes some intermediary event that causes *Y*, or it is actually just a coincidence. Keep in mind that we often don't have enough information to report causation—*correlation does not imply causation*.

Tip

Be sure to check out Tyler Vigen's *Spurious Correlations* blog (https://www.tylervigen.com/spurious-correlations) for some interesting correlations.

### Pitfalls of summary statistics

There is a very interesting dataset illustrating how careful we must be when only using summary statistics and correlation coefficients to describe our data. It also shows us that plotting is not optional. **Anscombe's quartet** is a collection of four different datasets that have identical summary statistics and correlation coefficients, but when plotted, it is obvious they are not similar:

Notice that each of the plots in *Figure 1.14* has an identical best-fit line defined by the equation **y = 0.50x + 3.00**. In the next section, we will discuss, at a high level, how this line is created and what it means.

Important note

Summary statistics are very helpful when we're getting to know the data, but be wary of relying exclusively on them. Remember, statistics can be misleading; be sure to also plot the data before drawing any conclusions or proceeding with the analysis. You can read more about Anscombe's quartet at https://en.wikipedia.org/wiki/Anscombe%27s_quartet. Also, be sure to check out the **Datasaurus Dozen**, which are 13 datasets that also have the same summary statistics, at https://www.autodeskresearch.com/publications/samestats.

## Prediction and forecasting

Say our favorite ice cream shop has asked us to help predict how many ice creams they can expect to sell on a given day. They are convinced that the temperature outside has a strong influence on their sales, so they have collected data on the number of ice creams sold at a given temperature. We agree to help them, and the first thing we do is make a scatter plot of the data they collected:

We can observe an upward trend in the scatter plot: more ice creams are sold at higher temperatures. In order to help out the ice cream shop, though, we need to find a way to make predictions from this data. We can use a technique called **regression** to model the relationship between temperature and ice cream sales with an equation. Using this equation, we will be able to **predict** ice cream sales at a given temperature.

Important note

Remember that correlation does not imply causation. People may buy ice cream when it is warmer, but warmer temperatures don't necessarily cause people to buy ice cream.

In *Chapter 9*, *Getting Started with Machine Learning in Python*, we will go over regression in depth, so this discussion will be a high-level overview. There are many types of regression that will yield a different type of equation, such as linear (which we will use for this example) and logistic. Our first step will be to identify the **dependent variable**, which is the quantity we want to predict (ice cream sales), and the variables we will use to predict it, which are called **independent variables**. While we can have many independent variables, our ice cream sales example only has one: temperature. Therefore, we will use simple linear regression to model the relationship as a line:

The regression line in the previous scatter plot yields the following equation for the relationship:

Suppose that today the temperature is 35°C—we would plug that in for *temperature* in the equation. The result predicts that the ice cream shop will sell 24.54 ice creams. This prediction is along the red line in the previous plot. Note that the ice cream shop can't actually sell fractions of ice cream.

Before leaving the model in the hands of the ice cream shop, it's important to discuss the difference between the dotted and solid portions of the regression line that we obtained. When we make predictions using the solid portion of the line, we are using **interpolation**, meaning that we will be predicting ice cream sales for temperatures the regression was created on. On the other hand, if we try to predict how many ice creams will be sold at 45°C, it is called **extrapolation** (the dotted portion of the line), since we didn't have any temperatures this high when we ran the regression. Extrapolation can be very dangerous as many trends don't continue indefinitely. People may decide not to leave their houses because it is so hot. This means that instead of selling the predicted 39.54 ice creams, they would sell zero.

When working with time series, our terminology is a little different: we often look to **forecast** future values based on past values. Forecasting is a type of prediction for time series. Before we try to model the time series, however, we will often use a process called **time series decomposition** to split the time series into components, which can be combined in an additive or multiplicative fashion and may be used as parts of a model.

The **trend** component describes the behavior of the time series in the **long term** without accounting for seasonal or cyclical effects. Using the trend, we can make broad statements about the time series in the long run, such as *the population of Earth is increasing* or *the value of a stock is stagnating*. The **seasonality** component explains the systematic and calendar-related movements of a time series. For example, the number of ice cream trucks on the streets of New York City is high in the summer and drops to nothing in the winter; this pattern repeats every year, regardless of whether the actual amount each summer is the same. Lastly, the **cyclical** component accounts for anything else unexplained or irregular with the time series; this could be something such as a hurricane driving the number of ice cream trucks down in the **short term** because it isn't safe to be outside. This component is difficult to anticipate with a forecast due to its unexpected nature.

We can use Python to **decompose** the time series into trend, seasonality, and **noise** or **residuals**. The cyclical component is captured in the noise (random, unpredictable data); after we remove the trend and seasonality from the time series, what we are left with is the residual:

When building models to forecast time series, some common methods include exponential smoothing and ARIMA-family models. **ARIMA** stands for **autoregressive** (**AR**), **integrated** (**I**), **moving average** (**MA**). **Autoregressive** models take advantage of the fact that an observation at time *t* is *correlated* to a previous observation, for example, at time *t - 1*. In *Chapter 5*, *Visualizing Data with Pandas and Matplotlib*, we will look at some techniques for determining whether a time series is autoregressive; note that not all time series are. The **integrated** component concerns the **differenced** data, or the *change* in the data from one time to another. For example, if we were concerned with a **lag** (distance between times) of 1, the differenced data would be the value at time *t* subtracted by the value at time *t - 1*. Lastly, the **moving average** component uses a sliding window to average the last *x* observations, where *x* is the length of the sliding window. If, for example, we have a 3-period moving average, by the time we have all of the data up to time 5, our moving average calculation only uses time periods 3, 4, and 5 to forecast time 6. We will build an ARIMA model in *Chapter 7*, *Financial Analysis – Bitcoin and the Stock Market*.

The moving average puts equal weight on each time period in the past involved in the calculation. In practice, this isn't always a realistic expectation of our data. Sometimes, *all* past values are important, but they *vary* in their influence on future data points. For these cases, we can use **exponential smoothing**, which allows us to put more weight on more recent values and less weight on values further away from what we are predicting.

Note that we aren't limited to predicting numbers; in fact, depending on the data, our predictions could be categorical in nature—things such as determining which flavor of ice cream will sell the most on a given day or whether an email is spam or not. This type of prediction will be introduced in *Chapter 9*, *Getting Started with Machine Learning in Python*.

## Inferential statistics

As mentioned earlier, inferential statistics deals with inferring or deducing things from the sample data we have in order to make statements about the population as a whole. When we're looking to state our conclusions, we have to be mindful of whether we conducted an observational study or an experiment. With an **observational study**, the independent variable is not under the control of the researchers, and so we are *observing* those taking part in our study (think about studies on smoking—we can't force people to smoke). The fact that we can't control the independent variable means that we *cannot* conclude causation.

With an **experiment**, we are able to directly influence the independent variable and randomly assign subjects to the control and test groups, such as A/B tests (for anything from website redesigns to ad copy). Note that the control group doesn't receive treatment; they can be given a placebo (depending on what the study is). The ideal setup for this is **double-blind**, where the researchers administering the treatment don't know which treatment is the placebo and also don't know which subject belongs to which group.

Important note

We can often find reference to Bayesian inference and frequentist inference. These are based on two different ways of approaching probability. Frequentist statistics focuses on the frequency of the event, while Bayesian statistics uses a degree of belief when determining the probability of an event. We will see an example of Bayesian statistics in *Chapter 11*, *Machine Learning Anomaly Detection*. You can read more about how these methods differ at https://www.probabilisticworld.com/frequentist-bayesian-approaches-inferential-statistics/.

Inferential statistics gives us tools to translate our understanding of the sample data to a statement about the population. Remember that the sample statistics we discussed earlier are estimators for the population parameters. Our estimators need **confidence intervals**, which provide a point estimate and a margin of error around it. This is the range that the true population parameter will be in at a certain **confidence level**. At the 95% confidence level, 95% of the confidence intervals that are calculated from random samples of the population contain the true population parameter. Frequently, 95% is chosen for the confidence level and other purposes in statistics, although 90% and 99% are also common; the higher the confidence level, the wider the interval.

Hypothesis tests allow us to test whether the true population parameter is less than, greater than, or not equal to some value at a certain **significance level** (called **alpha**). The process of performing a hypothesis test starts with stating our initial assumption or **null hypothesis**: for example, *the true population mean is 0*. We pick a level of statistical significance, usually 5%, which is the probability of rejecting the null hypothesis when it is true. Then, we calculate the critical value for the test statistic, which will depend on the amount of data we have and the type of statistic (such as the mean of one population or the proportion of votes for a candidate) we are testing. The critical value is compared to the test statistic from our data and, based on the result, we either reject or fail to reject the null hypothesis. Hypothesis tests are closely related to confidence intervals. The significance level is equivalent to 1 minus the confidence level. This means that a result is statistically significant if the null hypothesis value is not in the confidence interval.

Important note

There are many things we have to be aware of when picking the method to calculate a confidence interval or the proper test statistic for a hypothesis test. This is beyond the scope of this book, but check out the link in the *Further reading* section at the end of this chapter for more information. Also, be sure to look at some of the mishaps with the p-values used in hypothesis testing, such as p-hacking, at https://en.wikipedia.org/wiki/Misuse_of_p-values.

Now that we have an overview of statistics and data analysis, we are ready to get started with the Python portion of this book. Let's start by setting up a virtual environment.