Selecting meaningful features
If we notice that a model performs much better on a training dataset than on the test dataset, this observation is a strong indicator of overfitting. As we discussed in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn, overfitting means the model fits the parameters too closely with regard to the particular observations in the training dataset, but does not generalize well to new data, and we say the model has a high variance. The reason for the overfitting is that our model is too complex for the given training data. Common solutions to reduce the generalization error are listed as follows:
Collect more training data
Introduce a penalty for complexity via regularization
Choose a simpler model with fewer parameters
Reduce the dimensionality of the data
Collecting more training data is often not applicable. In Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, we will learn about a useful technique to check whether...