EVALUATION
If you’re familiar with the mechanics of machine learning, you might have noticed the absence of training and test data in the models used in the exercises thus far. An explanation for this vital question will be revealed later in this chapter, but, first, let’s review the rationale of split validation.
The partition of a dataset into training data and test data, known as split validation, is a fundamental part of machine learning. The training data is used to detect general patterns and design a prediction model, while the test data is used to road-test the model and compare the results. Thus, if we reserve 30% of the data and test it with the model developed from patterns discovered in the initial 70% of the data, will the model’s predictions still hold accurate?
Two possible reasons why the model may falter at making predictions using the test data are overfitting and underfitting. Overfitting exists when the model adjusts itself to fit patterns...