EXERCISES
CLARIFYING THE CONCEPTS
- Which four tasks should be undertaken during the Setup Phase?
- State two reasons why the Data Science Methodology does not follow the usual statistical inference paradigm.
- Describe what data dredging is and why data scientists need to avoid it.
- How do data scientists avoid data dredging?
- Describe the differences between the training data set and the test data set.
- When validating the partition, does the data scientist need to check every field?
- When validating a partition, which statistical test is used for a numerical variable?
- What is balancing? Why is it used?
- Describe what we mean by resampling.
- When should the test data set be balanced?
- Why is it important to establish baseline model performance?
- Describe the two baseline models for binary classification.
- True or false: there is no baseline model for k‐nary classification.
- What is the optimal benchmark for calibrating your model performance?
WORKING WITH THE DATA
For Exercises 15–...