12.1 THE NEED FOR DIMENSION REDUCTION
High dimensionality in data science refers to when there are a large number of predictors in the data set. For example, 100 predictors describe a 100‐dimensional space. So, why do we need dimension reduction in data science?
- Multicollinearity. Typically, large databases have many predictors. It is unlikely that all of these predictors are uncorrelated. Multicollinearity, which occurs when there is substantial correlation among the predictors, can lead to unstable regression models.
- Double‐Counting. Inclusion of predictors which are highly correlated tends to overemphasize a particular aspect of the model, that is, essentially double‐counting this aspect. For example, suppose we are trying to estimate the age of youngsters using math knowledge, height, and weight. Since height and weight are correlated, the model is essentially double‐counting the physical component of the youngster, as compared to the intellectual component...