6.1 INTRODUCTION TO DECISION TREES
Thus far, we have become acquainted with the first four phases of the Data Science Methodology:
- Data Understanding Phase
- Data Preparation Phase
- Exploratory Data Analysis Phase
- Setup Phase.
We are ready to finally begin modeling our data, in the Modeling Phase. Data science offers a wide variety of methods and algorithms for modeling large data sets. We begin here with one of the simplest methods: decision trees. In this chapter we will work with the adult_ch6_training and the adult_ch6_test data sets. These are adapted from the Adult data set from the UCI repository.1 For simplicity, only two predictors and the target are retained, as follows:
- Marital status, a categorical predictor with classes married, divorced, never‐married, separated, and widowed.
- Cap_gains_losses, a numerical predictor, equal to capital gains + |capital losses|.
- Income, a categorical target variable with two classes, >50k and ≤50k, representing...