Sanitizing data correctly
The act of sanitizing data is to clean it up before using it so that it doesn’t contain things such as PII or unneeded features. In addition, sanitization provides benefits to ML models that shouldn’t be ignored.
The example in this section relies on a database that is typical of information obtained from a corporate customer database, combined with an opinion poll. The Importing and combining the datasets section of Chapter 9, Defending against Hackers, shows a similar process where you combine mobility data with COVID statistics. This data combination is a common scenario today where businesses ask people’s opinions about everything, but the combined form of the database is completely inappropriate as it performs an analysis of how customers feel about product characteristics. Here are the goals for this analysis (which are likely simplified from what you will encounter in the real world, but work fine here):
- Improve sales...