Scrubbing data, also called data cleansing, is the process of correcting or removing data in a dataset that is incorrect, inaccurate, incomplete, improperly formatted, or duplicated.
The result of the data analysis process not only depends on the algorithms, it depends on the quality of the data. That's why the next step after obtaining the data, is data scrubbing. In order to avoid dirty data, our dataset should possess the following characteristics:
Correct
Completeness
Accuracy
Consistency
Uniformity
Dirty data can be detected by applying some simple statistical data validation and also by parsing the texts or deleting duplicate values. Missing or sparse data can lead you to highly misleading results.