A common issue in machine learning is the problem of overfitting data. Generally, overfitting is used to refer to the phenomenon where, in the data used to train the model, the model performs better than it does on data not used to train the model (holdout data, future real use, and so on). Overfitting occurs when a model fits what is essentially noise in the training data. It appears to become more accurate as it accounts for the noise, but because the noise changes from one dataset to the next, this accuracy does not apply to any data but the training data—it does not generalize.
Overfitting can occur at any time but tends to become more severe as the ratio of parameters to information increases. Usually, this is can be thought of as the ratio of parameters to observations, but not always (for example, suppose the outcome is a rare event that occurs in 1 in 5 million people, a sample size of 15 million may still only have 3 people...