5.2 PARTITIONING THE DATA
The Data Science Methodology does not use the statistical inference paradigm where generalization is made from a sample to a population. There are two reasons for this.
- Applying statistical inference to the huge sample sizes encountered in data science tends to result in statistical significance, even when the results are not of practical significance.
- In the statistical paradigm, the statistician has an a priori hypothesis in mind, whereas the Data Science Methodology requires no such a priori hypothesis, instead freely searching through the data for actionable results.
Because of the lack of a priori hypotheses, data scientists need to beware of data dredging, whereby phantom spurious results are uncovered, due merely to random variation rather than real effects. Data science avoids data dredging through the process of cross‐validation, a technique for ensuring that results are generalizable to an independent, unseen, data set. The most common methods...