5.3 VALIDATING YOUR PARTITION
Because the legitimacy of the entire Data Science Methodology depends on the validity of the partition, it is important to check that the training data set and the test data set do not differ systematically from each other. We can do this by checking, on a variable‐by‐variable basis, whether the training and test sets differ. Because there may be many variables in the data set, we restrict ourselves to spot‐checking a small set of randomly chosen variables. Depending on the variable types involved, different statistical tests are required.
- For a numerical variable, use the two‐sample t‐test for the difference in means.
- For a categorical variable with two classes, use the two‐sample Z‐test for the difference in proportions.
- For a categorical variable with more than two classes, use the test for the homogeneity of proportions.
For details on how to perform these tests, please see our earlier text.1