Home

Data

Enhancing Deep Learning with Bayesian Inference

By Matt Benatan , Jochem Gietema , Marian Schneider

Book + AI Assistant

eBook + AI Assistant $49.99 $34.98

Print $59.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook + AI Assistant $49.99 $34.98

Print $59.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Deep learning has an increasingly significant impact on our lives, from suggesting content to playing a key role in mission- and safety-critical applications. As the influence of these algorithms grows, so does the concern for the safety and robustness of the systems which rely on them. Simply put, typical deep learning methods do not know when they don’t know. The field of Bayesian Deep Learning contains a range of methods for approximate Bayesian inference with deep networks. These methods help to improve the robustness of deep learning systems as they tell us how confident they are in their predictions, allowing us to take more in how we incorporate model predictions within our applications. Through this book, you will be introduced to the rapidly growing field of uncertainty-aware deep learning, developing an understanding of the importance of uncertainty estimation in robust machine learning systems. You will learn about a variety of popular Bayesian Deep Learning methods, and how to implement these through practical Python examples covering a range of application scenarios. By the end of the book, you will have a good understanding of Bayesian Deep Learning and its advantages, and you will be able to develop Bayesian Deep Learning models for safer, more robust deep learning systems.

Publication date:: June 2023
Publisher: Packt
Pages: 386
ISBN: 9781803246888
Download code from GitHub

Chapter 2
Fundamentals of Bayesian Inference

Before we get into Bayesian inference with Deep Neural Networks (DNNs), we should take some time to understand the fundamentals. In this chapter, we’ll do just that: exploring the core concepts of Bayesian modeling, and taking a look at some of the popular methods used for Bayesian inference. By the end of this chapter, you should have a good understanding of why we use probabilistic modeling, and what kinds of properties we look for in well principled – or well conditioned – methods.

This content will be covered in the following sections:

Refreshing our knowledge of Bayesian modeling
Bayesian inference via sampling
Exploring the Gaussian processes

2.1 Refreshing our knowledge of Bayesian modeling

Bayesian modeling is concerned with understanding the probability of an event occurring given some prior assumptions and some observations. The prior assumptions describe our initial beliefs, or hypothesis, about the event. For example, let’s say we have two six-sided dice, and we want to predict the probability that the sum of the two dice is 5. First, we need to understand how many possible outcomes there are. Because each die has 6 sides, the number of possible outcomes is 6 × 6 = 36. To work out the possibility of rolling a 5, we need to work out how many combinations of values will sum to 5:

Figure 2.1: Illustration of all values summing to five when rolling two six-sided dice

As we can see here, there are 4 combinations that add up to 5, thus the probability of having two dice produce a sum of 5 is -4 36 , or 1 9 . We call this initial belief the prior. Now, what happens if we incorporate information from an observation? Let’s say we know what the value for one of the dice will be – let’s say 3. This shrinks our number of possible values down to 6, as we only have the remaining die to roll, and for the result to be 5, we’d need this value to be 2.

Figure 2.2: Illustration of remaining value, which sums to five after rolling the first die

Because we assume our die is fair, the probability of the sum of the dice being 5 is now 1 6 . This probability, called the posterior, is obtained using information from our observation. At the core of Bayesian statistics is Bayes’ rule (hence ”Bayesian”), which we use to determine the posterior probability given some prior knowledge. Bayes’ rule is defined as:

P(A |B ) = P(B-|A)×-P-(A)- P(B )

Where we can define P(A|B) as P(d₁ + d₂ = 5|d₁ = 3), where d₁ and d₂ represent dice 1 and 2 respectively. We can see this in action using our previous example. Starting with the likelihood, that is, the term on the left of our numerator, we see that:

1- P (B |A) = P (d1 = 3|d1 + d2 = 5) = 4

We can verify this by looking at our grid. Moving to the second part of the numerator – the prior – we see that:

4 1 P(A ) = P (d1 + d2 = 5) =--= -- 36 9

On the denominator, we have our normalization constant (also referred to as the marginal likelihood), which is simply:

P(B ) = P (d1 = 3) = 1 6

Putting this all together using Bayes’ theorem, we have:

14 × 19 1 P(d1 + d2 = 5|d1 = 3) = --1---= 6- 6

What we have here is the probability of the outcome being 5 if we know one die’s value. However, in this book, we’ll often be referring to uncertainties rather than probabilities – and learning methods to obtain uncertainty estimates with DNNs. These methods belong to a broader class of uncertainty quantification, and aim to quantify the uncertainty in the predictions from an ML model. That is, we want to predict P(ŷ|𝜃), where ŷ is a prediction from a model, and 𝜃 represents the parameters of the model.

As we know from fundamental probability theory, probabilities are bound between 0 and 1. The closer we are to 1, the more likely – or probable – the event is. We can view our uncertainty as subtracting our probability from 1. In the context of the example here, the probability of the sum being 5 is P(d₁ + d₂ = 5|d₁ = 3) = 1 6 = 0.166. So, our uncertainty is simply 1 − = = 0.833, meaning that there’s a > 80% chance that the outcome will not be 5. As we proceed through the book, we’ll learn about different sources of uncertainty, and how uncertainties can help us to develop more robust deep learning systems.

Let’s continue using our dice example to build a better understanding of for model uncertainty estimates. Many common machine learning models work on the basis of maximum likelihood estimation or MLE. That is, they look to predict the value that is most likely: tuning their parameters during training to produce the most likely outcome ŷ given some input x. As a simple illustration, let’s say we want to predict the value of d₁ + d₂ given a value of d₁. We can simply define this as the expectation of d₁ + d₂ conditioned on d₁:

ˆy = 𝔼 [d + d |d ] 1 2 1

That is, the mean of the possible values of d₁ + d₂.

Setting d₁ = 3, our possible values for d₁ + d₂ are {4,5,6,7,8,9} (as illustrated in Figure 2.2), making our mean:

1 ∑6 4+ 5 + 6+ 7+ 8 + 9 μ = -- ai = --------------------= 6.5 6 i=1 6

This is the value we’d get from a simple linear model, such as a linear regression defined by:

ˆy = βx + ξ

In this case, the values of our intersection and bias are β = 1, ξ = 3.5. If we change our value of d₁ to 1, we see that this mean changes to 4.5 – the mean of the set of possible values of d₁ + d₂|d₁ = 1, in other words {2,3,4,5,6,7}. This perspective on our model predictions is important: while this example is very straightforward, the same principle applies to far more sophisticated models and data. The value we typically see with ML models is the expectation, otherwise known as the mean. As you are likely aware, the mean is often referred to as the first statistical moment – with the second statistical moment being the variance, and the variance allows us to quantify uncertainty.

The variance for our simple example is defined as follows:

∑6 2 σ2 = --i=1(ai −-μ) n − 1

These statistical moments should be familiar to you, as should the fact that the variance here is represented as the square of the standard deviation, σ. For our example here, for which we assume d₂ is a fair die, the variance will always be constant: σ² = 2.917. That is to say, given any value of d₁, we know that values of d₂ are all equally likely, so the uncertainty does not change. But what if we have an unfair die d₂, which has a 50% chance of landing on a 6, and a 10% chance of landing on each other number? This changes both our mean and our variance. We can see this by looking at how we would represent this as a set of possible values (in other words, a perfect sample of the die) – the set of possible values for d₁ + d₂|d₁ = 1 now becomes {2,3,4,5,6,7,7,7,7,7}. Our new model will now have a bias of ξ = 4.5, making our prediction:

ˆy = 1 × 1 + 4.5 = 5.5

We see that the expectation has increased due to the change in the underlying probability of the values of die d₁. However, the important difference here is in the change in the variance value:

∑10 (a − μ)2 σ2 = --i=1--i----- = 3.25 n − 1

Our variance has increased. As variance essentially gives us the average of the distance of each possible value from the mean, this shouldn’t be surprising: given the weighted die, it’s more likely that the outcome will be distant from the mean than with an unweighted die, and thus our variance increases. To summarize, in terms of uncertainty: the greater the likelihood that the outcome will be further from the mean, the greater the uncertainty.

This has important implications for how we interpret predictions from machine learning models (and statistical models more generally). If our predictions are an approximation of the mean, and our uncertainty quantifies how likely it is for an outcome to be distant from the mean, then our uncertainty tells us how likely it is that our model prediction is incorrect. Thus, model uncertainties allow us to decide when to trust the predictions, and when we should be more cautious.

The examples given here are very basic, but should help to give you an idea of what we’re looking to achieve with model uncertainty quantification. We will continue to explore these concepts as we learn about some of the benchmark methods for Bayesian inference, learning how these concepts apply to more complex, real-world problems. We’ll start with perhaps the most fundamental method of Bayesian inference: sampling.

2.2 Bayesian inference via sampling

In practical applications, it’s not possible to know exactly what a given outcome would be, and, similarly, it’s not possible to observe all possible outcomes. In these cases, we need to make a best estimate based on the evidence we have. The evidence is formed of samples – observations of possible outcomes. The aim of ML, broadly speaking, is to learn models that generalize well from a subset of data. The aim of Bayesian ML is to do so while also providing an estimate of the uncertainty associated with the model’s predictions. In this section, we’ll learn about how we can use sampling to do this, and will also learn why sampling may not be the most sensible approach.

2.2.1 Approximating distributions

At the most fundamental level, sampling is about approximating distributions. Say we want to know the distribution of the height of people in New York. We could go out and measure everyone, but that would involve measuring the height of 8.4 million people! While this would give us our most accurate answer, it’s also a deeply impractical approach.

Instead, we can sample from the population. This gives us a basic example of Monte Carlo sampling, where we use random sampling to provide data from which we can approximate a distribution. For example, given a database of New York residents, we could select – at random – a sub-population of residents, and use this to approximate the height distribution of all residents. With random sampling – and any sampling, for that matter – the accuracy of the approximation is dependent on the size of the sub-population. What we’re looking to achieve is a statistically significant sub-sample, such that we can be confident in our approximation.

To get a better imdivssion of this, we’ll simulate the problem by generating 100,000 data points from a truncated normal distribution, to approximate the kind of height distribution we may see for a population of 100,000 people. Say we draw 10 samples, at random, from our population. Here’s what our distribution would look like (on the right) compared with the true distribution (on the left):

Figure 2.3: Plot of true distribution (left) versus sample distribution (right)

As we can see, this isn’t a great representation of the true distribution: what we see here is closer to a triangular distribution than a truncated normal. If we were to infer something about the population’s height based on this distribution alone, we’d arrive at a number of inaccurate conclusions, such as missing the truncation above 200 cm, and the tail on the left of the distribution.

We can get a better imdivssion by increasing our sample size – let’s try drawing 100 samples:

Figure 2.4: Plot of true distribution (left) versus sample distribution (right).

Things are starting to look better: we’re starting to see some of the tail on the left as well as the truncation toward 200 cm. However, this sample has sampled more from some regions than others, leading to misrepresentation: our mean has been pulled down, and we’re seeing two distinct peaks, rather than the single peak we see in the true distribution. Let’s increase our sample size by a further order of magnitude, scaling up to 1,000 samples:

Figure 2.5: Plot of true distribution (left) versus sample distribution (right)

This is looking much better – with a sample set of only one hundredth the size of our true population, we now see a distribution that closely matches our true distribution. This example demonstrates how, through random sampling, we can approximate the true distribution using a significantly smaller pool of observations. But that pool still has to have enough information to allow us to arrive at a good approximation of the true distribution: too few samples and our subset will be statistically insufficient, leading to poor approximation of the underlying distribution.

But simple random sampling isn’t the most practical method for approximating distributions. To achieve this, we turn to probabilistic inference. Given a model, probabilistic inference provides a way to find the model parameters that best describe our data. To do so, we need to first define the type of model – this is our prior. For our example, we’ll use a truncated Gaussian: the idea here being, using our intuition, it’s reasonable to assume people’s height follows a normal distribution, but that very few people are above, say, 6’5.” So, we’ll specify a truncated Gaussian distribution with an upper limit of 205 cm, or just over 6’5.” As it’s a Gaussian distribution, in other words, 𝒩(μ,σ), our model parameters are 𝜃 = {μ,σ} – with the additional constraint that our distribution has an upper limit of b = 205.

This brings us to a fundamental class of algorithms: Markov Chain Monte Carlo, or MCMC methods. Like simple random sampling, these allow us to build a picture of the true underlying distribution, but they do so sequentially, whereby each sample is dependent on the sample before it. This sequential dependence is known as the Markov property, thus the Markov chain component of the name. This sequential approach accounts for the probabilistic dependence between samples and allows us to better approximate the probability density.

MCMC achieves this through sequential random sampling. Just as with the random sampling we’re familiar with, MCMC randomly samples from our distribution. But, unlike simple random sampling, MCMC considers pairs of samples: some previous sample x_t−1 and some current sample x_t. For each pair of samples, we have some criteria that specifies whether or not we keep the sample (this varies depending on the particular flavor of MCMC). If the new value meets this criteria, say if x_t is ”preferential to” our previous value x_t−1, then the sample is added to the chain and becomes x_t for the next round. If the sample doesn’t meet the criteria, we stick with the current x_t for the next round. We repeat this over a (usually large) number of iterations, and in the end we should arrive at a good approximation of our distribution.

The result is an efficient sampling method that is able to closely approximate the true parameters of our distribution. Let’s see how this applies to our height distribution example. Using MCMC with just 10 samples, we arrive at the following approximation:

Figure 2.6: Plot of true distribution (left) versus approximate distribution via MCMC (right)

Not bad for ten samples – certainly far better than the triangular distribution we arrived at with simple random sampling. Let’s see how we do with 100:

Figure 2.7: Plot of true distribution (left) versus approximate distribution via MCMC (right)

This is looking pretty excellent – in fact, we’re able to obtain a better approximation of our distribution with 100 MCMC samples than we are with 1,000 simple random samples. If we continue to larger numbers of samples, we’ll arrive at closer and closer approximations of our true distribution. But our simple example doesn’t fully capture the power of MCMC: MCMC’s true advantage comes from being able to approximate high-dimensional distributions, and has made it an invaluable technique for approximating intractable high-dimensional integrals in a variety of domains.

In this book, we’re interested in how we can estimate the probability distribution of the parameters of machine learning models – this allows us to estimate the uncertainty associated with our predictions. In the next section, we’ll take a look at how we do this practically by applying sampling to Bayesian linear regression.

2.2.2 Implementing probabilistic inference with Bayesian linear regression

In typical linear regression, we want to predict some output ŷ from some input x using a linear function f(x), such that ŷ = βx + ξ. With Bayesian linear regression, we do this probabilistically, introducing another parameter, σ², such that our regression equation becomes:

ˆy = 𝒩 (x β + ξ,σ2 )

That is, ŷ follows a Gaussian distribution.

Here, we see our familiar bias term ξ and intercept β, and introduce a variance parameter σ². To fit our model, we need to define a prior over these parameters – just as we did for our MCMC example in the last section. We’ll define these priors as:

ξ ≈ 𝒩 (0,1 )

β ≈ 𝒩 (0,1)

σ2 ≈ |𝒩 (0,1)|

Note that equation 2.15 denotes the half-normal of a Gaussian distribution (the positive half of a zero-mean Gaussian, as standard deviation cannot be negative). We’ll refer to our model parameters as 𝜃 = β,ξ,σ², and we’ll use sampling to find the parameters that maximise the likelihood of these given our data, in other words, the conditional probability of our parameters given our data D: P(𝜃|D).

There are a variety of MCMC sampling approaches we could use to find our model parameters. A common approach is to use the Metropolis-Hastings algorithm. Metropolis-Hastings is particularly useful for sampling from intractable distributions. It does so through the use of a proposal distribution, Q(𝜃′|𝜃), which is proportional to, but not exactly equal to, our true distribution. This means that, for example, if some value x₁ is twice as likely as some other value x₂ in our true distribution, this will be true of our proposal distribution too. Because we’re interested in the probability of observations, we don’t need to know what the exact value would be in our true distribution – we just need to know that, proportionally, our proposal distribution is equivalent to our true distribution.

Here are the key steps of Metropolis-Hastings for our Bayesian linear regression.

First, we initialize with an arbitrary point 𝜃 sampled from our parameter space, according to the priors for each of our parameters. Using a Gaussian distribution centered on our first set of parameters 𝜃, select a new point 𝜃′. Then, for each iteration t ∈ T, do the following:

Calculate the acceptance criteria, defined as:
$P(𝜃′|D ) α = -------- P(𝜃|D )$
Generate a random number from a uniform distribution 𝜖 ∈ [0,1]. If 𝜖 <= α, accept the new candidate parameters – adding these to the chain, assigning 𝜃 = 𝜃′. If 𝜖 > α, keep the current 𝜃 and draw a new value.

This acceptance criteria means that, if our new set of parameters have a higher likelihood than our last set of parameters, we’ll see α > 1, in which case α < 𝜖. This means that, when we sample parameters that are more likely given our data, we’ll always accept these parameters. If, on the other hand, α < 1, there’s a chance we’ll reject the parameters, but we may also accept them – allowing us to explore regions of lower likelihood.

These mechanics of Metropolis-Hastings result in samples that can be used to compute high-quality approximations of our posterior distribution. Practically, Metropolis-Hastings (and MCMC methods more generally) requires a burn-in phase – an initial phase of sampling used to escape regions of low density, which are typically encountered given the arbitrary initialization.

Let’s apply this to a simple problem: we’ll generate some data for the function y = x² + 5 + η, where η is a noise parameter distributed according to η ≈𝒩(0,5). Using Metropolis-Hastings to fit our Bayesian linear regressor, we get the following fit using the points sampled from our function (represented by the crosses):

Figure 2.8: Bayesian linear regression on generated data with low variance

We see that our model fits the data in the same way we would expect for standard linear regression. However, unlike standard linear regression, our model produces predictive uncertainty: this is represented by the shaded region. This predictive uncertainty gives an imdivssion of how much our underlying data varies; this makes this model much more useful than a standard linear regression, as now we can get an imdivssion of the sdivad of our data, as well as the general trend. We can see how this varies if we generate new data and fit again, this time increasing the sdivad of the data by modifying our noise distribution to η ≈𝒩(0,20):

Figure 2.9: Bayesian linear regression on generated data with high variance

We see that our predictive uncertainty has increased proportionally to the sdivad of the data. This is an important property in uncertainty-aware methods: when we have small uncertainty, we know our prediction fits the data well, whereas when we have large uncertainty, we know to treat our prediction with caution, as it indicates the model isn’t fitting this region particularly well. We’ll see a better example of this in the next section, which will go on to demonstrate how regions of more or less data contribute to our model uncertainty estimates.

Here, we see that our predictions fit our data pretty well. In addition, we see that σ² varies according to the availability of data in different regions. What we’re seeing here is a great example of a very important concept, well calibrated uncertainty – also termed high-quality uncertainty. This refers to the fact that, in regions where our Predictions are inaccurate, our uncertainty is also high. Our uncertainty estimates are poorly calibrated if we’re very confident in regions with inaccurate predictions, or very uncertain in regions with accurate predictions. As it’s well-calibrated, sampling is often used as a benchmark for uncertainty quantification.

Unfortunately, while sampling is effective for many applications, the need to obtain many samples for each parameter means that it quickly becomes computationally prohibitive for high dimensions of parameters. For example, if we wanted to start sampling parameters for complex, non-linear relationships (such as sampling the weights of a neural network), sampling would no longer be practical. Despite this, it’s still useful in some cases, and later we’ll see how various BDL methods make use of sampling.

In the next section, we’ll explore the Gaussian process – another fundamental method for Bayesian inference, and a method that does not suffer from the same computational overheads as sampling.

2.3 Exploring the Gaussian process

As we’ve seen in the previous section, sampling quickly becomes prohibitively expensive. To address this, we can use ML models specifically designed to produce uncertainty estimates – the gold standard of which is the Gaussian process.

The Gaussian process, or GP, has become a staple probabilistic ML model, seeing use in a broad variety of applications from pharmacology through to robotics. Its success is largely down to its ability to produce high-quality uncertainty estimates over its predictions in a well-principled fashion. So, what do we mean by a Gaussian process?

In essence, a GP is a distribution over functions. To understand what we mean by this, let’s take a typical ML use case. We want to learn some function f(x), which maps a series of inputs x onto a series of outputs y, such that we can approximate our output via y = f(x). Before we see any data, we know nothing about our underlying function; there is an infinite number of possible functions this could be:

Figure 2.10: Illustration of space of possible functions before seeing data

Here, the black line is the true function we wish to learn, while the dotted lines are the possible functions given the data (in this case, no data). Once we observe some data, we see that the number of possible functions becomes more constrained, as we see here:

Figure 2.11: Illustration of space of possible functions after seeing some data

Here, we see that our possible functions all pass through our observed data points, but outside of those data points, our functions take on a range of very different values. In a simple linear model, we don’t care about these deviations in possible values: we’re happy to interpolate from one data point to another, as we see in Figure 2.12:

Figure 2.12: Illustration of linearly interpolating through our observations

But this interpolation can lead to wildly inaccurate predictions, and has no way of accounting for the degree of uncertainty associated with our model predictions. The deviations that we see here in the regions without data points are exactly what we want to capture with our GP. When there are a variety of possible values our function can take, then there is uncertainty – and through capturing the degree of uncertainty, we are able to estimate what the possible variation in these regions may be.

Formally, a GP can be defined as a function:

f(x) ≈ GP (m (x),k(x,x′))

Here, m(x) is simply the mean of our possible function values for a given point x:

m (x) = 𝔼[f (x)]

The next term, k(x,x′) is a covariance function, or kernel. This is a fundamental component of the GP as it defines the way we model the relationship between different points in our data. GPs use the mean and covariance functions to model the space of possible functions, and thus to produce predictions as well as their associated uncertainties. Now that we’ve introduced some of the high-level concepts, let’s dig a little deeper and understand exactly how it is they model the space of possible functions, and thus estimate uncertainty. To do this, we need to understand GP priors.

2.3.1 Defining our prior beliefs with kernels

GP kernels describe the prior beliefs we have about our data, and so you’ll often see them referred to as GP priors. In the same way that the prior in equation 2.3 tells us something about the probability of the outcome of our two dice rolls, the GP prior tells us something important about the relationship we expect from our data.

While there are advanced methods for inferring a prior from our data, they are beyond the scope of this book. We will instead focus on more traditional uses of GPs, for which we select a prior using our knowledge of the data we’re working with.

In the literature and any implementations you encounter, you’ll see that the GP prior is often referred to as the kernel or covariance function (just as we have here). These three terms are all interchangeable, but for consistency with other work, we will henceforth refer to this as the kernel. Kernels simply provide a means of calculating a distance between two data points, and are exdivssed as k(x,x′), where x and x′ are data points, and k() represents the function of the kernel. While the kernel can take on many forms, there are a small number of fundamental kernels that are used in a large proportion of GP applications.

Perhaps the most commonly encountered kernel is the squared exponential or radial basis function (RBF) kernel. This kernel takes the form:

(x − x ′)2 k(x,x ′) = σ2exp − ----2---- 2l

This introduces us to a couple of common kernel parameters: l and σ². The output variance parameter σ² is simply a scaling factor, used to control the distance of the function from its mean. The length scale parameter l controls the smoothness of the function – in other words, how much your function is expected to vary across particular dimensions. This parameter can either be a scalar that is applied to all input dimensions, or a vector with a different scalar value for each input dimension. The latter is often achieved using Automatic Relevance Determination, or ARD, which identifies the relevant values in the input space.

GPs make predictions via a covariance matrix based on the kernel – essentially comparing a new data point to previously observed data points. However, just as with all ML models, GPs need to be trained, and this is where the length scale comes in. The length scale forms the parameters of our GP, and through the training process it learns the optimal value(s) for the length scale(s). This is typically done using a nonlinear optimizer, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) optimizer. Many optimizers can be used, including optimizers you may be familiar with for deep learning, such as stochastic gradient descent and its variants.

Let’s take a look at how different kernels affect GP predictions. We’ll start with a straightforward example – a simple sine wave:

Figure 2.13: Plot of sine wave with four sampled points

We can see the function illustrated here, as well as some points sampled from this function. Now, let’s fit a GP with a periodic kernel to the data. The periodic kernel is defined as:

′ 2 ( 2sin2(π |x − x′|∕p)) kper(x, x) = σ exp -------l2--------

Here, we see a new parameter: p. This is simply the period of the periodic function. Setting p = 1 and applying a GP with a periodic kernel to the preceding example, we get the following:

Figure 2.14: Plot of posterior predictions from a periodic kernel with p = 1

This looks pretty noisy, but you should be able to see that there is clear periodicity in the functions produced by the posterior. It’s noisy for a couple of reasons: a lack of data, and a poor prior. If we’re limited on data, we can try to fix the problem by improving our prior. In this case, we can use our knowledge of the periodicity of the function to improve our prior by setting p = 6:

Figure 2.15: Plot of posterior predictions from a periodic kernel with p = 6

We see that this fits the data pretty well: we’re still uncertain in regions for which we have little data, but the periodicity of our posterior now looks sensible. This is possible because we’re using an informative prior; that is, a prior that incorporates information that describes the data well. This prior is composed of two key components:

Our periodic kernel
Our knowledge about the periodicity of the function

We can see how important this is if we modify our GP to use an RBF kernel:

Figure 2.16: Plot of posterior predictions from an RBF kernel

With an RBF kernel, we see that things are looking pretty chaotic again: because we have limited data and a poor prior, we’re unable to appropriately constrain the space of possible functions to fit our true function. In the ideal case, we’d fix this by using a more appropriate prior, as we saw in Figure 2.15 – but this isn’t always possible. Another solution is to sample more data. Sticking with our RBF kernel, we sample 10 data points from our function and re-train our GP:

Figure 2.17: Plot of posterior predictions from an RBF kernel, trained on 10 observations

This is looking much better – but what if we have more data and an informative prior?

Figure 2.18: Plot of posterior predictions from a periodic kernel with p = 6, trained on 10 observations

The posterior now fits our true function very closely. Because we don’t have infinite data, there are still some areas of uncertainty, but the uncertainty is relatively small.

Now that we’ve seen some of the core principles in action, let’s return to our example from Figures 2.10-2.12. Here’s a quick reminder of our target function, our posterior samples, and the linear interpolation we saw earlier:

Figure 2.19: Plot illustrating the difference between linear interpolation and the true function

Now that we’ve got some idea of how a GP will affect our predictive posterior, it’s easy to see that linear interpolation falls very short of what we achieve with a GP. To illustrate this more clearly, let’s take a look at what the GP prediction would be for this function given the three samples:

Figure 2.20: Plot illustrating the difference between GP predictions and the true function

Here, the dotted lines are our mean (μ) predictions from the GP, and the shaded area is the uncertainty associated with those predictions – the standard deviation (σ) around the mean. Let’s contrast what we see in Figure 2.20 with Figure 2.19. The differences may seem subtle at first, but we can clearly see that this is no longer a straightforward linear interpolation: the predicted values from the GP are being ”pulled” toward our actual function values. As with our earlier sine wave examples, the behavior of the GP predictions are affected by two key factors: the prior (or kernel) and the data.

But there’s another crucial detail illustrated in Figure 2.20: the predictive uncertainties from our GP. We see that, unlike many typical ML models, a GP gives us uncertainties associated with its predictions. This means we can make better decisions about what we do with the model’s predictions – having this information will help us to ensure that our systems are more robust. For example, if the uncertainty is too great, we can fall back to a manual system. We can even keep track of data points with high predictive uncertainty so that we can continuously refine our models.

We can see how this refinement affects our predictions by adding a few more observations – just as we did in the earlier examples:

Figure 2.21: Plot illustrating the difference between GP predictions and the true function, trained on 5 observations

Figure 2.21 illustrates how our uncertainty changes over regions with different numbers of observations. We see here that between x = 3 and x = 4 our uncertainty is quite high. This makes a lot of sense, as we can also see that our GP’s mean predictions deviate significantly from the true function values. Conversely, if we look at the region between x = 0.5 and x = 2, we can see that our GP’s predictions follow the true function fairly closely, and our model is also more confident about these predictions, as we can see from the smaller interval of uncertainty in this region.

What we’re seeing here is a great example of a very important concept: well calibrated uncertainty – also termed high-quality uncertainty. This refers to the fact that, in regions where our predictions are inaccurate, our uncertainty is also high. Our uncertainty estimates are poorly calibrated if we’re very confident in regions with inaccurate predictions, or very uncertain in regions with accurate predictions.

GPs are what we can term a well principled method – this means that they have solid mathematical foundations, and thus come with strong theoretical guarantees. One of these guarantees is that they are well calibrated, and this is what makes GPs so popular: if we use GPs, we know we can rely on their uncertainty estimates.

Unfortunately, however, GPs are not without their shortcomings – we’ll learn more about these in the following section.

2.3.2 Limitations of Gaussian processes

Given the fact that GPs are well-principled and capable of producing high-quality uncertainty estimates, you’d be forgiven for thinking they’re the perfect uncertainty-aware ML model. GPs struggle in a few key situations:

High-dimensional data
Large amounts of data
Highly complex data

The first two points here are largely down to the inability of GPs to scale well. To understand this, we just need to look at the training and inference procedures for GPs. While it’s beyond the scope of this book to cover this in detail, the key point here is in the matrix operations required for GP training.

During training, it is necessary to invert a D × D matrix, where D is the dimensionality of our data. Because of this, GP training quickly becomes computationally prohibitive. This can be somewhat alleviated through the use of Cholesky deomposition, rather than direct matrix inversion. As well as being more computationally efficient, Cholesky decomposition is also more numerically stable. Unfortunately, Cholesky decomposition also has its weaknesses: computationally, its complexity is O(n³). This means that, as the size of our dataset increases, GP training becomes more and more expensive.

But it’s not only training that’s affected: because we need to compute the covariance between a new data point and all observed data points at inference, GPs have a O(n²) computational complexity at inference.

As well as the computational cost, GPs aren’t light in memory: because we need to store our covariance matrix K, GPs have a O(n²) memory complexity. Thus, in the case of large datasets, even if we have the compute resources necessary to train them, it may not be practical to use them in real-world applications due to their memory requirements.

The last point in our list concerns the complexity of data. As you are probably aware – and as we’ll touch on in Chapter 3, Fundamentals of Deep Learning – one of the major advantages of DNNs is their ability to process complex, high-dimensional data through layers of non-linear transformations. While GPs are powerful, they’re also relatively simple models, and they’re not able to learn the kinds of powerful feature representations that are possible with DNNs.

All of these factors mean that, while GPs are an excellent choice for relatively low-dimensional data and reasonably small datasets, they aren’t practical for many of the complex problems we face in ML. And so, we turn to BDL methods: methods that have the flexibility and scalability of deep learning, while also producing model uncertainty estimates.

2.4 Summary

In this chapter, we’ve covered some of the fundamental concepts and methods related to Bayesian inference. First, we reviewed Bayes’ theorem and the fundamentals of probability theory – allowing us to understand the concept of uncertainty, as well as how we apply it to the predictions of ML models. Next, we introduced sampling, and an important class of algorithms: Markov Chain Monte Carlo, or MCMC, methods. Lastly, we covered Gaussian processes, and illustrated the crucial concept of well calibrated uncertainty. These key topics will provide you with the necessary foundation for the content that will follow, however, we encourage you to explore the recommended reading materials for a more comprehensive treatment of the topics introduced in this chapter.

In the next chapter, we will see how DNNs have changed the landscape of machine learning over the last decade, exploring the tremendous advantages offered by deep learning, and the motivation behind the development of BDL methods.

2.5 Further reading

There are a variety of techniques being explored to improve the flexibility and scalability of GPs – such as Deep GPs or Sparse GPs. The following resources explore some of these topics, and also provide a more thorough treatment of the content covered in this chapter:

Bayesian Analysis with Python, Martin: this book comprehensively covers core topics in statistical modeling and probabilistic programming, and includes practical walk-throughs of various sampling methods, as well as a good overview of Gaussian processes and a variety of other techniques core to Bayesian analysis.
Gaussian Processes for Machine Learning, Rasmussen and Williams: this is often considered the definitive text on Gaussian processes, and provides highly detailed explanations of the theory underlying Gaussian processes. A key text for anyone serious about Bayesian inference.

About the Authors

Matt Benatan

Matt Benatan is a Principal Research Scientist at Sonos and a Simon Industrial Fellow at the University of Manchester. His work involves research in robust multimodal machine learning, uncertainty estimation, Bayesian optimization, and scalable Bayesian inference.
Browse publications by this author
Jochem Gietema

Jochem Gietema is an Applied Scientist at Onfido in London where he has developed and deployed several patented solutions related to anomaly detection, computer vision, and interactive data visualisation.
Browse publications by this author
Marian Schneider

Marian Schneider is an applied scientist in machine learning. His work involves developing and deploying applications in computer vision, ranging from brain image segmentation and uncertainty estimation to smarter image capture on mobile devices.
Browse publications by this author

Enhancing Deep Learning with Bayesian Inference

Chapter 2 Fundamentals of Bayesian Inference

2.1 Refreshing our knowledge of Bayesian modeling

2.2 Bayesian inference via sampling

2.2.1 Approximating distributions

2.2.2 Implementing probabilistic inference with Bayesian linear regression

2.3 Exploring the Gaussian process

2.3.1 Defining our prior beliefs with kernels

2.3.2 Limitations of Gaussian processes

2.4 Summary

2.5 Further reading

Chapter 2
Fundamentals of Bayesian Inference