Until this point in the book, we have striven to make the regression model fit data, even by modifying the data itself (inputting missing data, removing outliers, transforming for non-linearity, or creating new features). By keeping an eye on measures such as R-squared, we have tried our best to reduce prediction errors, though we have no idea to what extent this was successful.
The problem we face now is that we shouldn't expect a well fit model to automatically perform well on any new data during production.
While defining and explaining the problem, we recall what we said about underfitting. Since we are working with a linear model, we are actually expecting to apply our work to data that has a linear relationship with the response variable. Having a linear relationship means that, with respect to the level of the response variable, our predictors always tend to constantly increase (or decrease) at the same rate. Graphically, on a scatterplot, this is refigured...