Real-world data is frequently dirty and unstructured, and must be reworked before it is usable. Data may contain errors, have duplicate entries, exist in the wrong format, or be inconsistent. The process of addressing these types of issues is called data cleaning. Data cleaning is also referred to as data wrangling, massaging, reshaping , or munging. Data merging, where data from multiple sources is combined, is often considered to be a data cleaning activity.
We need to clean data because any analysis based on inaccurate data can produce misleading results. We want to ensure that the data we work with is quality data. Data quality involves:
Validity: Ensuring that the data possesses the correct form or structure
Accuracy: The values within the data are truly representative of the dataset
Completeness: There are no missing elements
Consistency: Changes to data are in sync
Uniformity: The same units of measurement are used
There are several techniques and tools used...