Observations that differ greatly from the rest of the observations, that is, they are located in the long tail(s) of the data distribution, are outliers. In this recipe, we will learn how to locate and handle the outliers.
To execute this recipe, you need to have a working Spark environment. Also, we will be working off of the imputed
DataFrame we created in the previous recipe, so we assume you have followed the steps to handle missing observations.
No other prerequisites are required.
Let's start with a popular definition of an outlier.
A point,
, that meets the following criteria:
Is not considered an outlier; any point outside this range is. In the preceding equation, Q1 is the first quartile (25th percentile), Q3 is the third quartile, and IQR is the interquartile range and is defined as the difference between Q3 and Q1 : IQR= Q3-Q1.
To flag the outliers, follow these steps:
- Let's calculate our ranges first:
features = ['Displacement', 'Cylinders...