Book Image

Enhancing Deep Learning with Bayesian Inference

By : Matt Benatan, Jochem Gietema, Marian Schneider
Book Image

Enhancing Deep Learning with Bayesian Inference

By: Matt Benatan, Jochem Gietema, Marian Schneider

Overview of this book

Deep learning has an increasingly significant impact on our lives, from suggesting content to playing a key role in mission- and safety-critical applications. As the influence of these algorithms grows, so does the concern for the safety and robustness of the systems which rely on them. Simply put, typical deep learning methods do not know when they don’t know. The field of Bayesian Deep Learning contains a range of methods for approximate Bayesian inference with deep networks. These methods help to improve the robustness of deep learning systems as they tell us how confident they are in their predictions, allowing us to take more in how we incorporate model predictions within our applications. Through this book, you will be introduced to the rapidly growing field of uncertainty-aware deep learning, developing an understanding of the importance of uncertainty estimation in robust machine learning systems. You will learn about a variety of popular Bayesian Deep Learning methods, and how to implement these through practical Python examples covering a range of application scenarios. By the end of the book, you will have a good understanding of Bayesian Deep Learning and its advantages, and you will be able to develop Bayesian Deep Learning models for safer, more robust deep learning systems.
Table of Contents (11 chapters)

2.1 Refreshing our knowledge of Bayesian modeling

Bayesian modeling is concerned with understanding the probability of an event occurring given some prior assumptions and some observations. The prior assumptions describe our initial beliefs, or hypothesis, about the event. For example, let’s say we have two six-sided dice, and we want to predict the probability that the sum of the two dice is 5. First, we need to understand how many possible outcomes there are. Because each die has 6 sides, the number of possible outcomes is 6 × 6 = 36. To work out the possibility of rolling a 5, we need to work out how many combinations of values will sum to 5:

PIC

Figure 2.1: Illustration of all values summing to five when rolling two six-sided dice

As we can see here, there are 4 combinations that add up to 5, thus the probability of having two dice produce a sum of 5 is -4 36, or 1 9. We call this initial belief the prior. Now, what happens if we incorporate information from an observation? Let’s say we know what the value for one of the dice will be – let’s say 3. This shrinks our number of possible values down to 6, as we only have the remaining die to roll, and for the result to be 5, we’d need this value to be 2.

PIC

Figure 2.2: Illustration of remaining value, which sums to five after rolling the first die

Because we assume our die is fair, the probability of the sum of the dice being 5 is now 1 6. This probability, called the posterior, is obtained using information from our observation. At the core of Bayesian statistics is Bayes’ rule (hence ”Bayesian”), which we use to determine the posterior probability given some prior knowledge. Bayes’ rule is defined as:

P(A |B ) = P(B-|A)×-P-(A)- P(B )

Where we can define P(A|B) as P(d1 + d2 = 5|d1 = 3), where d1 and d2 represent dice 1 and 2 respectively. We can see this in action using our previous example. Starting with the likelihood, that is, the term on the left of our numerator, we see that:

1- P (B |A) = P (d1 = 3|d1 + d2 = 5) = 4

We can verify this by looking at our grid. Moving to the second part of the numerator – the prior – we see that:

 4 1 P(A ) = P (d1 + d2 = 5) =--= -- 36 9

On the denominator, we have our normalization constant (also referred to as the marginal likelihood), which is simply:

P(B ) = P (d1 = 3) = 1 6

Putting this all together using Bayes’ theorem, we have:

 14 × 19 1 P(d1 + d2 = 5|d1 = 3) = --1---= 6- 6

What we have here is the probability of the outcome being 5 if we know one die’s value. However, in this book, we’ll often be referring to uncertainties rather than probabilities – and learning methods to obtain uncertainty estimates with DNNs. These methods belong to a broader class of uncertainty quantification, and aim to quantify the uncertainty in the predictions from an ML model. That is, we want to predict P(ŷ|𝜃), where ŷ is a prediction from a model, and 𝜃 represents the parameters of the model.

As we know from fundamental probability theory, probabilities are bound between 0 and 1. The closer we are to 1, the more likely – or probable – the event is. We can view our uncertainty as subtracting our probability from 1. In the context of the example here, the probability of the sum being 5 is P(d1 + d2 = 5|d1 = 3) = 1 6 = 0.166. So, our uncertainty is simply 1 16 = 56 = 0.833, meaning that there’s a > 80% chance that the outcome will not be 5. As we proceed through the book, we’ll learn about different sources of uncertainty, and how uncertainties can help us to develop more robust deep learning systems.

Let’s continue using our dice example to build a better understanding of for model uncertainty estimates. Many common machine learning models work on the basis of maximum likelihood estimation or MLE. That is, they look to predict the value that is most likely: tuning their parameters during training to produce the most likely outcome ŷ given some input x. As a simple illustration, let’s say we want to predict the value of d1 + d2 given a value of d1. We can simply define this as the expectation of d1 + d2 conditioned on d1:

ˆy = 𝔼 [d + d |d ] 1 2 1

That is, the mean of the possible values of d1 + d2.

Setting d1 = 3, our possible values for d1 + d2 are {4,5,6,7,8,9} (as illustrated in Figure 2.2), making our mean:

 1 ∑6 4+ 5 + 6+ 7+ 8 + 9 μ = -- ai = --------------------= 6.5 6 i=1 6

This is the value we’d get from a simple linear model, such as a linear regression defined by:

ˆy = βx + ξ

In this case, the values of our intersection and bias are β = 1, ξ = 3.5. If we change our value of d1 to 1, we see that this mean changes to 4.5 – the mean of the set of possible values of d1 + d2|d1 = 1, in other words {2,3,4,5,6,7}. This perspective on our model predictions is important: while this example is very straightforward, the same principle applies to far more sophisticated models and data. The value we typically see with ML models is the expectation, otherwise known as the mean. As you are likely aware, the mean is often referred to as the first statistical moment – with the second statistical moment being the variance, and the variance allows us to quantify uncertainty.

The variance for our simple example is defined as follows:

 ∑6 2 σ2 = --i=1(ai −-μ) n − 1

These statistical moments should be familiar to you, as should the fact that the variance here is represented as the square of the standard deviation, σ. For our example here, for which we assume d2 is a fair die, the variance will always be constant: σ2 = 2.917. That is to say, given any value of d1, we know that values of d2 are all equally likely, so the uncertainty does not change. But what if we have an unfair die d2, which has a 50% chance of landing on a 6, and a 10% chance of landing on each other number? This changes both our mean and our variance. We can see this by looking at how we would represent this as a set of possible values (in other words, a perfect sample of the die) – the set of possible values for d1 + d2|d1 = 1 now becomes {2,3,4,5,6,7,7,7,7,7}. Our new model will now have a bias of ξ = 4.5, making our prediction:

ˆy = 1 × 1 + 4.5 = 5.5

We see that the expectation has increased due to the change in the underlying probability of the values of die d1. However, the important difference here is in the change in the variance value:

 ∑10 (a − μ)2 σ2 = --i=1--i----- = 3.25 n − 1

Our variance has increased. As variance essentially gives us the average of the distance of each possible value from the mean, this shouldn’t be surprising: given the weighted die, it’s more likely that the outcome will be distant from the mean than with an unweighted die, and thus our variance increases. To summarize, in terms of uncertainty: the greater the likelihood that the outcome will be further from the mean, the greater the uncertainty.

This has important implications for how we interpret predictions from machine learning models (and statistical models more generally). If our predictions are an approximation of the mean, and our uncertainty quantifies how likely it is for an outcome to be distant from the mean, then our uncertainty tells us how likely it is that our model prediction is incorrect. Thus, model uncertainties allow us to decide when to trust the predictions, and when we should be more cautious.

The examples given here are very basic, but should help to give you an idea of what we’re looking to achieve with model uncertainty quantification. We will continue to explore these concepts as we learn about some of the benchmark methods for Bayesian inference, learning how these concepts apply to more complex, real-world problems. We’ll start with perhaps the most fundamental method of Bayesian inference: sampling.