Dealing with outliers
We are not on this studying journey just to pass the AWS Machine Learning Specialty exam, but also to become better data scientists. There are many different ways to look at the outlier problem purely from a mathematical perspective; however, the datasets we use are derived from the underlying business process, so we must include a business perspective during an outlier analysis.
An outlier is an atypical data point in a set of data. For example, the following chart shows some data points that have been plotted in a two-dimension plan; that is, x and y. The red point is an outlier, since it is an atypical value on this series of data:
We want to treat outlier values because some statistical methods are impacted by them. Still, in the preceding chart, we can see this behavior in action. On the left-hand side, we drew a line that best fits those data points, ignoring the red point. On the right...