Handling Incorrect Values
Another issue you may face with a new dataset is incorrect values for some of the observations in the dataset. Sometimes, this is due to a syntax error; for instance, the name of a country may be written all in lower case, all in upper case, as a title (where only the first letter is capitalized), or may even be abbreviated. France may take different values, such as 'France', 'FRANCE', 'france', 'FR', and so on. If you define 'France' as the standard format, then all the other variants are considered incorrect values in the dataset and need to be fixed.
If this kind of issue is not handled before the modeling phase, it can lead to incorrect results. The model will think these different variants are completely different values and may pay less attention to these values since they have separated frequencies. For instance, let's say that 'France' represents 2% of the value, 'FRANCE&apos...