# Understanding the mathematical basis for statistical analysis and ML modeling

Looking at what we have learned so far, it becomes abundantly clear that ML requires an ample understanding of mathematics. We already came across multiple mathematical functions we have to handle. Think about the activation function of neurons and the optimizer and loss functions for training. On top of that, we have not talked about the second aspect of our new programming paradigm—the data!

To choose the right ML algorithm and derive a good metric for a loss function, we have to take apart the data points we work with. In addition, we need to bring in the data points in relation to the domain we are working with. Therefore, when defining the role of a data scientist, you will often find a visual like this one:

In this section, we will concentrate on what is referred to in *Figure 1.5* as *statistical research*. We will understand why we need statistics and what base information we can derive from a given dataset, learn what bias is and ways to avoid that, mathematically classify possible ML algorithms, and finally, discuss how we choose useful metrics to define the performance of our trained models.

## The case for statistics in ML

As we have seen, we require statistics to clean and analyze our given data. Therefore, let's start by asking: *What do we understand from the term "statistics"?*

A typical example of something such as this would be the prediction for the results of an election you see during the campaign or shortly after voting booths close. At those points in time, we do not know the precise result of the full **population** but we can acquire a **sample**, sometimes referred to as an **observation**. We get that by asking people for responses through a questionnaire. Then, based on this subset, we make a sound prediction for the full population by applying statistical methods.

We learned that in ML, we are trying to let the machine figure out a mathematical function that fits our problem, such as this:

Thinking back to our ANN, would be an input vector and would be the resulting output vector. In ML jargon, they are known under a different name, as seen next.

Features and Labels

One element of the input vector *x* is called a feature; the full output vector is called the label. Often, we only deal with a **one-dimensional** label.

Now, to bring this together, when training an ML model, we typically only have a sample of the given world, and as with any other time you are dealing with only a sample or subset of reality, you want to pick highly representative features and samples of the underlying population.

So, what does this mean? Let's think of an example. Imagine you want to train a small little robot car to be able to automatically drive through a tunnel. First, we need to think about what our features and labels in this scenario are. As features, we probably need something that measures the distance from the edges of the car to the tunnel in each direction, as we probably do not want to drive into the sides of the tunnel. Let's assume we have some infrared sensors attached to the front, the sides, and the back of the vehicle. Then, the output of our program would probably control the steering and the speed of the vehicle, which would be our labels.

Given that, as a next step, we should think of a whole bunch of scenarios in which the vehicle could find itself. This might be a simple scenario of the vehicle sitting straight-facing in the tunnel, or it could be a bad scenario where the vehicle is nearly stuck in a corner and the tunnel is going left or right from that point on. In all these cases, we read out the values of our infrared sensors and then do the more complicated tasks of making an educated guess as to how the steering has to be changed and how the motor has to operate. Eventually, we end up with a bunch of example situations and corresponding actions to take, which would be our training dataset. This can then be used to train an ANN so that the small car can learn how to follow a tunnel.

If you ever get the opportunity, try to perform this training. If you pick very good examples, you will understand the full power of ML, as you will most likely see something exciting, which I can attest to. In my setup, even though we never had a sample where we would instruct the vehicle to drive backward, the optimal function the machine trained had values where the vehicle learned to do exactly that.

In an example such as that, we would do everything from scratch and hopefully take representative samples by ourselves. In most cases you will encounter, the dataset already exists, and you need to figure out whether it is representative or whether we need to introduce additional data to achieve an optimal training result.

Therefore, let's have a look at some statistical properties you should familiarize yourself with.

## Basics of statistics

We now understand that we need to be able to analyze the statistical properties of single features, derive their distribution, and analyze their relationship with other features and labels in the dataset.

Let's start with the properties of single features and their distribution. All the following operations require numerical data. This means that if you work with categorical data or something such as media files, you need to transform them into some form of numerical representation to get such results.

The following screenshot shows the main statistical properties you are after, their importance, and how you can calculate them:

From here onward, we can make the reasonable assumption that the underlying stochastic process follows a **normal distribution**. Be aware that this must not be the case, and therefore you should make yourself comfortable with other distributions (see https://www.itl.nist.gov/div898/handbook/eda/section3/eda36.htm).

The following screenshot shows a visual representation of a standard normal distribution:

Now, the strength of this normal distribution is that, based on the mean and standard deviation , we can make assumptions for the probabilities of samples to be in a certain range. As shown in *Figure 1.7*, there is a probability of around **68.27%** for a value to have a distance from the mean of 1, **95.45%** for a distance of , and **99.73%** for a distance of . Based on this, we can ask questions such as this:

*How probable is it to find a value with a distance of 5** from the mean?*

Through questions such as this, we can start assessing whether what we see in our data is a statistical anomaly of the distribution, is a value that is simply false, or whether our suspected distribution is incorrect. This is done through a process called **hypothesis testing**, defined next.

Hypothesis Testing (Definition)

This is a method of testing if the so-called null hypothesis is false, typically referring to the current suspected distribution. It means that the unlikely observation we encounter is pure chance. This hypothesis is rejected in favor of an alternative hypothesis , if the probability falls below a predefined significance level (typically higher than /lower than 5%). The alternative hypothesis thus presumes that the observation we have is due to a real effect that is not taken into account in the initial distribution.

We will not go into further details on how to perform this test properly, but we urge you to familiarize yourself with this process thoroughly.

What we will talk about is the types of errors you can make in this process, as shown in the following screenshot:

We define the errors you see in *Figure 1.8* as follows:

**Type I error**: This denotes that we reject the hypothesis and the underlying distribution, even though it is correct. This is also referred to as a**false-positive**result or an**alpha error**.**Type II error**: This denotes that we do not reject the hypothesis and the underlying distribution, even though is correct. This error is also referred to as a**false-negative**result or a**beta****error**.

You might have heard the term *false positive* before. Often, it comes up when you take a medical test. A false positive would denote that you have a positive result from a test, even though you do not have the disease you are testing for. As a medical test is also a **stochastic process**, as with nearly everything else in our world, the term is correctly used in this scenario.

At the end of this section, when we talk about errors and metrics in ML model training, we will come back to these definitions. As a final step, let's discuss relationships among features and between features and labels. Such a relationship is referred to as a **correlation**.

There are multiple ways to calculate a correlation between two vectors and , but what they all have in common is that their results will fall in the range of [-1,1]. The result of this operation can be broadly defined by the following three categories:

**Negatively correlated**: The result leans toward -1. When the value of vector rises, the values of vector fall and vice versa.**Uncorrelated**: The result leans toward 0. There is no real interaction between vectors and .**Positively correlated**: The result leans toward 1. When the value of vector rises, the values of vector rise and vice versa.

Through this, we can get an idea of relationships between data points, but please be aware of the differences between causation and correlation, as outlined next.

Causation versus Correlation

Even if two vectors are correlated with each other, it does not mean one of them is the cause of the other one—it simply means that one of them influences the other one. It is not causation as we probably don't see the full picture and every single influencing factor.

The mathematical theory we discussed so far should give you a good basis to build upon. In the next section, we will have a quick look at what kinds of errors we can make when taking samples, typically referred to as the bias in the data.

## Understanding bias

At any stage of taking samples and when working with data, it is easily possible to introduce what is called **bias**. Typically, this influences the sampling quality and therefore has a big impact on any ML model we would like to fit to the data.

One example would be the *causation versus correlation* we just discussed. Seeing causation where none exists can have consequences in terms of the way you continue processing the data points. Other prominent biases that influence data are shown next:

**Selection bias**: This bias happens when samples are taken that are not representative of the real-life distribution of data. This is the case when randomization is not properly done or when only a certain subgroup is selected for a study—for example, when a questionnaire about city planning is only given out to people in half of the neighborhoods of the city.**Funding bias**: This bias should be very well known and happens when a study or data project is funded by a sponsor and the results will therefore have a tendency toward the interests of the funding party.**Reporting bias**: This bias happens when only a selection of outcomes is represented in a dataset due to the fact that it is the tendency of people to underreport certain outcomes. Examples of this are given here: when you report bad weather events but not when there is sunshine; when you write negative reviews for a product but not positive reviews; when you only know about results written in your own language or from your own region but not from others.**Observer bias/confirmation bias**: This bias happens when someone favors results that confirm or support their own beliefs and values. Typically, this results in ignoring contrary information, not following the agreed guideline, or using ambiguous studies that support the existing preconceived opinion. The dangerous part here is that this can happen unconsciously.**Exclusion bias**: This bias happens when you remove data points during preprocessing that you consider irrelevant but are not. This includes removing null values, outliers, or other special data points. The removal might result in the loss of accuracy concerning the underlying real-life distribution.**Automation bias**: This bias happens when you favor results generated from automated systems over information taken from humans, even if they are correct.**Overgeneralization bias**: This bias happens when you project a property of your dataset toward the whole population. An example would be that you would assume that all cats have gray fur because in the large dataset you have, this is true.**Group attribution bias**: This bias happens when stereotypes are added as attributes to a whole group because of the actions of a few individuals within that group.**Survivorship bias**: This bias happens when you focus on successful examples while completely ignoring failures. An example would be that you study the competition of your company while ignoring all companies that failed, merged, or went bankrupt.

This list should give you a good understanding of problems that may arise when gathering and processing data. We can only urge you to read further into this topic while following these next guidelines.

Guidance for Handling Bias in Data

When using existing datasets, figure out the circumstances in which they were obtained to be able to judge their quality. When processing data either alone or in a team, define clear guidelines on how you define data and how you handle certain situations, and always reflect whether you are making assumptions based on your own predispositions.

To solidify your understanding that things are—most of the time—not as they seem, have a look at what is referred to as **Simpson's paradox** and the corresponding **University of California** (**UC**) Berkeley case (http://corysimon.github.io/articles/simpsons-paradox/).

Now that we have a good understanding of what to look out for when working with data, let's come back to the basics of ML.

## Classifying ML algorithms

In the first section of this chapter, we got a glimpse into ANNs. These are special in the sense that they can be used in a so-called supervised or unsupervised training setup. To understand what is meant by this, let's define the current three major types of ML algorithms, as follows:

**Supervised learning**: In supervised learning, models are trained with a so-called labeled dataset. That means besides knowing the input for the required algorithm, we also know the required output. This type of learning is split into two groups of problems—namely,**classification problems**and**regression problems**. Classification works with discrete results, where the output is a class or group, while regression works with continuous results, where the output would be a certain value. Examples of classification would be identifying fraud in money transactions or doing object detection in images. Examples of regression would be forecasting prices for houses or the stock market or predicting population growth. It is important to understand that this type of learning*requires*labels, which often results in the tedious task of labeling the whole dataset.**Unsupervised learning**: In unsupervised learning, models are trained on unlabeled data. This is basically self-organized learning to find patterns in data, referred to as**clustering**. Examples of this would be the filtering of spam emails in an inbox or the recommendation of movies or clothing a person might like to watch or purchase. Often, the learning algorithms are used in a real-time scenario where the data needs to be processed directly. The beauty of this type of learning is that we do not have to label the dataset.**Reinforcement learning**: In reinforcement learning, algorithms learn by reacting to a given environment on their own. The idea of this comes from how we as humans learn as we grow up. We did a certain action, and the outcome of that action was either good or bad or somewhere in between. We then either receive some sort of reward or we don't. Another similar example would be the way you would train a dog to behave. Technically, this is realized through a so-called*agent*that is guided by a*policy map*, deciding the probability to take actions when in a specific state. For the environment itself, we define a so-called*state-value function*that returns the*value*of being in a specific state. Good examples of this type of learning are training navigation control for a robot or an AI opponent for a game.

The following diagram provides an overview of the discussed ML types and the corresponding algorithms that are utilized in those areas:

A detailed overview of many of the prominent ML algorithms can be found on the *scikit-learn* web page (https://scikit-learn.org/stable/), which is one of the major Python libraries for ML.

Now that we have an idea of the types of training we can perform, let's have a short look at what types of results we get from a training run and how to interpret them.

## Analyzing errors and the quality of results of model training

As we discussed in the first section of this chapter, we require a loss function that we can minimize to optimize our training results. Typically, this is defined through what is referred to in mathematics as a metric. We need to differentiate at this point between metrics that are used to define a loss function and therefore used in an optimizer to train the model, and metrics that can be calculated to give additional hints toward the performance of the trained model. We will have a look at both kinds in this section.

As we have seen when looking at types of ML algorithms, we might work with an output represented by continuous data (regression), or we might work with an output represented by discrete data (classification).

The most prominent loss functions used in regression are **MSE** and **root MSE **(**RMSE**). Imagine you try to determine a fitted line for a bunch of samples in linear regression. The distance between the line and the sample point in **two-dimensional** (**2D**) space is your error. To calculate the RMSE for all data points, you would take the expected values and the predicted values and calculate the following:

For classifications, this gets a little bit trickier. In most cases, the model can predict the correct class or cannot, making it a binary result. Further, we might have a binary classification problem (1 or 0—yes or no), or a multi-class problem (cat, dog, horse, and so on).

For both classification problems, there is a prominent loss function used called **cross-entropy loss**. To solve the problem of having a binary result, this loss function requires a model that outputs a probability between 0 and 1 for a given data point and a suggested prediction . For a binary classification model, it is calculated as follows:

For multi-class classification, we sum up this error for all classes , as follows:

If you want to look further into this topic, consider other useful loss functions for regression, such as the **absolute error** loss and the **Huber loss** functions (used in **support vector machines**, or **SVMs**), useful loss functions for binary classification, such as the **hinge loss** function, and useful loss functions for multi-class classification, such as the **Kullback-Leibler divergence** (**KL-divergence**) function. The last one can also be used in RL as a metric to monitor the policy function during training.

Everything we have discussed so far requires something we can put into a mathematical formula. Imagine working with text files to build a model for **natural language processing** (**NLP**). In such a case, we do not have a useful mathematical representation for text besides something such as **Unicode**. We will learn in *Chapter 7*, *Advanced Feature Extraction with NLP*, how to represent it in a useful, vectorized manner. Having vectors, we can use a different kind of metric to calculate how similar vectors are, called the **cosine similarity** metric, which we will discuss in *Chapter 6*,* Feature Engineering and Labeling*.

So far, we have discussed how to calculate loss functions for a couple of scenarios, but how can we define the performance of our model overall?

For regression models, our loss function was defined over the whole corpus of our training set. The error of a single observation or prediction would be . Therefore, RMSE is already a cost function and can be used by an optimizer to improve the model performance, so we can use it to judge the performance of the model.

For classification models, this gets a little bit more interesting. Cross-entropy can be used with an optimizer to train the model and can be used to judge the model, but besides that, we can define an additional metric to look out for.

Something obvious would be what is referred to as the **accuracy** of a model, calculated as follows:

Now, this looks about right. We just say that the quality of our model is the percentage of how often we guessed correctly, and the reality is that a lot of people agree with this statement. Remember when we defined **false positives** and **false negatives**? These now come into play. Let's look at an example.

Imagine a test that checks for a contagious virus. *Figure 1.10* shows the results for 100 people being tested for this virus, including the correctness of the results:

Now, what would be the accuracy of this test given these results? Let's define it again using the values for true positive (), false positive (), false negative (), and true negative () and calculate the results for our example, as follows:

This sounds like a good test. It gives accurate results in 92% of cases, but perhaps you see the problem here. Accuracy sees everything equally. Our test misclassifies someone having the virus eight times as someone being virus-free, which might have dire ramifications. That means it might be useful having performance metrics that put more emphasis on false-positive or false-negative outcomes. Therefore, let's define two additional metrics to calculate.

The first one we call **precision**, a value that defines how many positive identifications were correct. The formula is shown here:

In our example, only in two out of three cases are we correct when we declare someone to be infected. A model with a precision value of 1 would have no false-positive results.

The second one we call **recall**, a value that defines how many positive results we identify correctly. The formula is shown here:

This means in our example, we correctly identify 20% of all infected patients, which is a bad result. A model with a recall value of 1 would have no false-negative results.

To evaluate our test or classification correctly, we need to evaluate accuracy, precision, and recall. Be aware that, as mentioned when we talked about hypothesis testing, precision and recall can work against each other. Therefore, you often have to decide whether you prefer to be precise when saying "*You have the virus*" or whether you prefer to find everyone who has the virus. You might now understand why such tests are often designed toward recall.

With this, we conclude the section on the mathematical basis required to get better at building ML models and working with data. Based on what we have learned so far, you should take the next point with you.

Important Note

Never just use methods from ML libraries for data analysis and modeling; understand them mathematically.

In the next section, we will guide you through the structure of the end-to-end ML process and the structure of this book.