Chapter 3. Feature Improvement - Cleaning Datasets
In the last two chapters, we have gone from talking about a basic understanding of feature engineering and how it can be used to enhance our machine learning pipelines to getting our hands dirty with datasets and evaluating and understanding the different types of data that we can encounter in the wild.
In this chapter, we will be using what we learned and taking things a step further and begin to change the datasets that we work with. Specifically, we will be starting to clean and augment our datasets. By cleaning, we will generally be referring to the process of altering columns and rows already given to us. By augmenting, we will generally refer to the processes of removing columns and adding columns to datasets. As always, our goal in all of these processes is to enhance our machine learning pipelines.
In the following chapters, we will be:
- Identifying missing values in data
- Removing harmful data
- Imputing (filling in) these missing values...