So far, we have learned that simply minimizing the loss function (or equivalently maximizing the log likelihood function in the case of normal distribution) is not enough to develop a machine learning model for a given problem. One has to worry about models overfitting the training data, which will result in larger prediction errors on new datasets. The main advantage of Bayesian methods is that one can, in principle, get away from this problem, without using explicit regularization and different datasets for training and validation. This is called Bayesian model averaging and will be discussed here. This is one of the answers to our main question of the chapter, *why Bayesian inference for machine learning?*

For this, let's do a full Bayesian treatment of the linear regression problem. Since we only want to explain how Bayesian inference avoids the overfitting problem, we will skip all the mathematical derivations and state only the important results here. For more details...