Book Image

Enhancing Deep Learning with Bayesian Inference

By : Matt Benatan, Jochem Gietema, Marian Schneider
Book Image

Enhancing Deep Learning with Bayesian Inference

By: Matt Benatan, Jochem Gietema, Marian Schneider

Overview of this book

Deep learning has an increasingly significant impact on our lives, from suggesting content to playing a key role in mission- and safety-critical applications. As the influence of these algorithms grows, so does the concern for the safety and robustness of the systems which rely on them. Simply put, typical deep learning methods do not know when they don’t know. The field of Bayesian Deep Learning contains a range of methods for approximate Bayesian inference with deep networks. These methods help to improve the robustness of deep learning systems as they tell us how confident they are in their predictions, allowing us to take more in how we incorporate model predictions within our applications. Through this book, you will be introduced to the rapidly growing field of uncertainty-aware deep learning, developing an understanding of the importance of uncertainty estimation in robust machine learning systems. You will learn about a variety of popular Bayesian Deep Learning methods, and how to implement these through practical Python examples covering a range of application scenarios. By the end of the book, you will have a good understanding of Bayesian Deep Learning and its advantages, and you will be able to develop Bayesian Deep Learning models for safer, more robust deep learning systems.
Table of Contents (11 chapters)

2.2 Bayesian inference via sampling

In practical applications, it’s not possible to know exactly what a given outcome would be, and, similarly, it’s not possible to observe all possible outcomes. In these cases, we need to make a best estimate based on the evidence we have. The evidence is formed of samples – observations of possible outcomes. The aim of ML, broadly speaking, is to learn models that generalize well from a subset of data. The aim of Bayesian ML is to do so while also providing an estimate of the uncertainty associated with the model’s predictions. In this section, we’ll learn about how we can use sampling to do this, and will also learn why sampling may not be the most sensible approach.

2.2.1 Approximating distributions

At the most fundamental level, sampling is about approximating distributions. Say we want to know the distribution of the height of people in New York. We could go out and measure everyone, but that would involve measuring the height of 8.4 million people! While this would give us our most accurate answer, it’s also a deeply impractical approach.

Instead, we can sample from the population. This gives us a basic example of Monte Carlo sampling, where we use random sampling to provide data from which we can approximate a distribution. For example, given a database of New York residents, we could select – at random – a sub-population of residents, and use this to approximate the height distribution of all residents. With random sampling – and any sampling, for that matter – the accuracy of the approximation is dependent on the size of the sub-population. What we’re looking to achieve is a statistically significant sub-sample, such that we can be confident in our approximation.

To get a better imdivssion of this, we’ll simulate the problem by generating 100,000 data points from a truncated normal distribution, to approximate the kind of height distribution we may see for a population of 100,000 people. Say we draw 10 samples, at random, from our population. Here’s what our distribution would look like (on the right) compared with the true distribution (on the left):

PIC

Figure 2.3: Plot of true distribution (left) versus sample distribution (right)

As we can see, this isn’t a great representation of the true distribution: what we see here is closer to a triangular distribution than a truncated normal. If we were to infer something about the population’s height based on this distribution alone, we’d arrive at a number of inaccurate conclusions, such as missing the truncation above 200 cm, and the tail on the left of the distribution.

We can get a better imdivssion by increasing our sample size – let’s try drawing 100 samples:

PIC

Figure 2.4: Plot of true distribution (left) versus sample distribution (right).

Things are starting to look better: we’re starting to see some of the tail on the left as well as the truncation toward 200 cm. However, this sample has sampled more from some regions than others, leading to misrepresentation: our mean has been pulled down, and we’re seeing two distinct peaks, rather than the single peak we see in the true distribution. Let’s increase our sample size by a further order of magnitude, scaling up to 1,000 samples:

PIC

Figure 2.5: Plot of true distribution (left) versus sample distribution (right)

This is looking much better – with a sample set of only one hundredth the size of our true population, we now see a distribution that closely matches our true distribution. This example demonstrates how, through random sampling, we can approximate the true distribution using a significantly smaller pool of observations. But that pool still has to have enough information to allow us to arrive at a good approximation of the true distribution: too few samples and our subset will be statistically insufficient, leading to poor approximation of the underlying distribution.

But simple random sampling isn’t the most practical method for approximating distributions. To achieve this, we turn to probabilistic inference. Given a model, probabilistic inference provides a way to find the model parameters that best describe our data. To do so, we need to first define the type of model – this is our prior. For our example, we’ll use a truncated Gaussian: the idea here being, using our intuition, it’s reasonable to assume people’s height follows a normal distribution, but that very few people are above, say, 6’5.” So, we’ll specify a truncated Gaussian distribution with an upper limit of 205 cm, or just over 6’5.” As it’s a Gaussian distribution, in other words, 𝒩(μ,σ), our model parameters are 𝜃 = {μ,σ} – with the additional constraint that our distribution has an upper limit of b = 205.

This brings us to a fundamental class of algorithms: Markov Chain Monte Carlo, or MCMC methods. Like simple random sampling, these allow us to build a picture of the true underlying distribution, but they do so sequentially, whereby each sample is dependent on the sample before it. This sequential dependence is known as the Markov property, thus the Markov chain component of the name. This sequential approach accounts for the probabilistic dependence between samples and allows us to better approximate the probability density.

MCMC achieves this through sequential random sampling. Just as with the random sampling we’re familiar with, MCMC randomly samples from our distribution. But, unlike simple random sampling, MCMC considers pairs of samples: some previous sample xt1 and some current sample xt. For each pair of samples, we have some criteria that specifies whether or not we keep the sample (this varies depending on the particular flavor of MCMC). If the new value meets this criteria, say if xt is ”preferential to” our previous value xt1, then the sample is added to the chain and becomes xt for the next round. If the sample doesn’t meet the criteria, we stick with the current xt for the next round. We repeat this over a (usually large) number of iterations, and in the end we should arrive at a good approximation of our distribution.

The result is an efficient sampling method that is able to closely approximate the true parameters of our distribution. Let’s see how this applies to our height distribution example. Using MCMC with just 10 samples, we arrive at the following approximation:

PIC

Figure 2.6: Plot of true distribution (left) versus approximate distribution via MCMC (right)

Not bad for ten samples – certainly far better than the triangular distribution we arrived at with simple random sampling. Let’s see how we do with 100:

PIC

Figure 2.7: Plot of true distribution (left) versus approximate distribution via MCMC (right)

This is looking pretty excellent – in fact, we’re able to obtain a better approximation of our distribution with 100 MCMC samples than we are with 1,000 simple random samples. If we continue to larger numbers of samples, we’ll arrive at closer and closer approximations of our true distribution. But our simple example doesn’t fully capture the power of MCMC: MCMC’s true advantage comes from being able to approximate high-dimensional distributions, and has made it an invaluable technique for approximating intractable high-dimensional integrals in a variety of domains.

In this book, we’re interested in how we can estimate the probability distribution of the parameters of machine learning models – this allows us to estimate the uncertainty associated with our predictions. In the next section, we’ll take a look at how we do this practically by applying sampling to Bayesian linear regression.

2.2.2 Implementing probabilistic inference with Bayesian linear regression

In typical linear regression, we want to predict some output ŷ from some input x using a linear function f(x), such that ŷ = βx + ξ. With Bayesian linear regression, we do this probabilistically, introducing another parameter, σ2, such that our regression equation becomes:

ˆy = 𝒩 (x β + ξ,σ2 )

That is, ŷ follows a Gaussian distribution.

Here, we see our familiar bias term ξ and intercept β, and introduce a variance parameter σ2. To fit our model, we need to define a prior over these parameters – just as we did for our MCMC example in the last section. We’ll define these priors as:

ξ ≈ 𝒩 (0,1 )
β ≈ 𝒩 (0,1)
σ2 ≈ |𝒩 (0,1)|

Note that equation 2.15 denotes the half-normal of a Gaussian distribution (the positive half of a zero-mean Gaussian, as standard deviation cannot be negative). We’ll refer to our model parameters as 𝜃 = β,ξ,σ2, and we’ll use sampling to find the parameters that maximise the likelihood of these given our data, in other words, the conditional probability of our parameters given our data D: P(𝜃|D).

There are a variety of MCMC sampling approaches we could use to find our model parameters. A common approach is to use the Metropolis-Hastings algorithm. Metropolis-Hastings is particularly useful for sampling from intractable distributions. It does so through the use of a proposal distribution, Q(𝜃′|𝜃), which is proportional to, but not exactly equal to, our true distribution. This means that, for example, if some value x1 is twice as likely as some other value x2 in our true distribution, this will be true of our proposal distribution too. Because we’re interested in the probability of observations, we don’t need to know what the exact value would be in our true distribution – we just need to know that, proportionally, our proposal distribution is equivalent to our true distribution.

Here are the key steps of Metropolis-Hastings for our Bayesian linear regression.

First, we initialize with an arbitrary point 𝜃 sampled from our parameter space, according to the priors for each of our parameters. Using a Gaussian distribution centered on our first set of parameters 𝜃, select a new point 𝜃. Then, for each iteration t T, do the following:

  1. Calculate the acceptance criteria, defined as:

     P(𝜃′|D ) α = -------- P(𝜃|D )
  2. Generate a random number from a uniform distribution 𝜖 [0,1]. If 𝜖 <= α, accept the new candidate parameters – adding these to the chain, assigning 𝜃 = 𝜃. If 𝜖 > α, keep the current 𝜃 and draw a new value.

This acceptance criteria means that, if our new set of parameters have a higher likelihood than our last set of parameters, we’ll see α > 1, in which case α < 𝜖. This means that, when we sample parameters that are more likely given our data, we’ll always accept these parameters. If, on the other hand, α < 1, there’s a chance we’ll reject the parameters, but we may also accept them – allowing us to explore regions of lower likelihood.

These mechanics of Metropolis-Hastings result in samples that can be used to compute high-quality approximations of our posterior distribution. Practically, Metropolis-Hastings (and MCMC methods more generally) requires a burn-in phase – an initial phase of sampling used to escape regions of low density, which are typically encountered given the arbitrary initialization.

Let’s apply this to a simple problem: we’ll generate some data for the function y = x2 + 5 + η, where η is a noise parameter distributed according to η ≈𝒩(0,5). Using Metropolis-Hastings to fit our Bayesian linear regressor, we get the following fit using the points sampled from our function (represented by the crosses):

PIC

Figure 2.8: Bayesian linear regression on generated data with low variance

We see that our model fits the data in the same way we would expect for standard linear regression. However, unlike standard linear regression, our model produces predictive uncertainty: this is represented by the shaded region. This predictive uncertainty gives an imdivssion of how much our underlying data varies; this makes this model much more useful than a standard linear regression, as now we can get an imdivssion of the sdivad of our data, as well as the general trend. We can see how this varies if we generate new data and fit again, this time increasing the sdivad of the data by modifying our noise distribution to η ≈𝒩(0,20):

PIC

Figure 2.9: Bayesian linear regression on generated data with high variance

We see that our predictive uncertainty has increased proportionally to the sdivad of the data. This is an important property in uncertainty-aware methods: when we have small uncertainty, we know our prediction fits the data well, whereas when we have large uncertainty, we know to treat our prediction with caution, as it indicates the model isn’t fitting this region particularly well. We’ll see a better example of this in the next section, which will go on to demonstrate how regions of more or less data contribute to our model uncertainty estimates.

Here, we see that our predictions fit our data pretty well. In addition, we see that σ2 varies according to the availability of data in different regions. What we’re seeing here is a great example of a very important concept, well calibrated uncertainty – also termed high-quality uncertainty. This refers to the fact that, in regions where our Predictions are inaccurate, our uncertainty is also high. Our uncertainty estimates are poorly calibrated if we’re very confident in regions with inaccurate predictions, or very uncertain in regions with accurate predictions. As it’s well-calibrated, sampling is often used as a benchmark for uncertainty quantification.

Unfortunately, while sampling is effective for many applications, the need to obtain many samples for each parameter means that it quickly becomes computationally prohibitive for high dimensions of parameters. For example, if we wanted to start sampling parameters for complex, non-linear relationships (such as sampling the weights of a neural network), sampling would no longer be practical. Despite this, it’s still useful in some cases, and later we’ll see how various BDL methods make use of sampling.

In the next section, we’ll explore the Gaussian process – another fundamental method for Bayesian inference, and a method that does not suffer from the same computational overheads as sampling.