Book Image

Data Cleaning and Exploration with Machine Learning

By : Michael Walker
Book Image

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results. As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book. By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.
Table of Contents (23 chapters)
1
Section 1 – Data Cleaning and Machine Learning Algorithms
5
Section 2 – Preprocessing, Feature Selection, and Sampling
9
Section 3 – Modeling Continuous Targets with Supervised Learning
13
Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning
19
Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

Identifying extreme values and outliers in univariate analysis

An outlier can be thought of as an observation with feature values, or relationships between feature values, that are so unusual that they cannot help explain relationships in the rest of the data. This matters for modeling because we cannot assume that the outliers will have a neutral impact on our parameter estimates. Sometimes, our models work so hard to construct parameter estimates that can account for patterns in outlier observations that we compromise the model's explanatory or predictive power for all other observations. Raise your hand if you have ever spent days trying to interpret a model only to discover that your coefficients and predictions completely changed once you removed a few outliers.

I should quickly add that there is no agreed-upon definition of an outlier. I offer the preceding definition for use in this book because it helps us distinguish between outliers, as I have described them, and extreme values. There is a fair bit of overlap between the two, but many extreme values are not outliers. This is because such values reflect a natural and explainable trend in a feature, or because they reflect the same relationship between features as is observed throughout the data. The reverse is also true. Some outliers are not extreme values. For example, a target value might be right in the middle of the distribution but have quite unexpected predictor values.

For our modeling, then, it is hard to say that a particular feature or target value is an outlier without referencing multivariate relationships. But it should at least raise a red flag when, in our univariate analysis, we see values well to the left or right of the center. This should prompt us to investigate the observation at that value further, including examining the values of other features. We will look at multivariate relationships in more detail in the next two chapters. Here, and in the next section on visualizations, we will focus on identifying extreme values and outliers when looking at one variable.

A good starting point for identifying an extreme value is to look at its distance from the middle of the distribution. One common method for doing that is to calculate each value's distance from the interquartile range (IQR), which is the distance between the first quartile value and the third quartile value. We often flag any value that is more than 1.5 times the interquartile range above the third quartile or below the first quartile. We can use this method to identify outliers in the COVID-19 data.

Let's get started:

  1. Let's start by importing the libraries we will need. In addition to pandas and NumPy, we will use Matplotlib and statsmodels for the plots we will create. We will also load the COVID data and select the variables we need:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import statsmodels.api as sm
    covidtotals = pd.read_csv("data/covidtotals.csv")
    covidtotals.set_index("iso_code", inplace=True)
    keyvars = ['location','total_cases_mill','total_deaths_mill',
      'aged_65_older','diabetes_prevalence','gdp_per_capita']
    covidkeys = covidtotals[keyvars]
  2. Let's take a look at total_cases_mill. We get the first and third quartile values and calculate the interquartile range, 1.5*(thirdq-firstq). Then, we calculate their thresholds to determine the high and low extreme values, which are interquartilerange+thirdq and firstq-interquartilerange, respectively (if you are familiar with boxplots, you will notice that this is the same calculation that's used for the whiskers of a boxplot; we will cover boxplots in the next section):
    thirdq, firstq = covidkeys.total_cases_mill.quantile(0.75), covidkeys.total_cases_mill.quantile(0.25)
    interquartilerange = 1.5*(thirdq-firstq)
    extvalhigh, extvallow = interquartilerange+thirdq, firstq-interquartilerange
    print(extvallow, extvalhigh, sep=" <--> ")
    -91002.564625 <--> 158336.930375
  3. This calculation indicates that any value for total_cases_mill that's above 158,337 can be considered extreme. We can ignore extreme values on the low end because they would be negative:
    covidtotals.loc[covidtotals.total_cases_mill>extvalhigh].T
    iso_code                   AND         MNE         SYC
    lastdate            2021-07-07  2021-07-07  2021-07-07
    location            Andorra     Montenegro  Seychelles
    total_cases          14,021        100,392      16,304
    total_deaths            127          1,619          71
    total_cases_mill    181,466        159,844     165,792
    total_deaths_mill     1,644          2,578         722
    population           77,265        628,062      98,340
    population_density      164             46         208
    median_age              NaN             39          36
    gdp_per_capita          NaN         16,409      26,382
    aged_65_older           NaN             15           9
    total_tests_thous       NaN            NaN         NaN
    life_expectancy          84             77          73
    hospital_beds_thous     NaN              4           4
    diabetes_prevalence       8             10          11
    region              Western        Eastern        East 
                         Europe         Europe      Africa
  4. Andorra, Montenegro, and Seychelles all have total_cases_mill above the threshold amount. This invites us to explore other ways these countries might be exceptional, and whether our features can capture that. We will not dive deeply into multivariate analysis here since we will do that in the next chapter, but it is a good idea to start wrapping our brains around why these extreme values may or may not make sense. Having some means across the full dataset may help us here:
    covidtotals.mean()
    total_cases               963,933
    total_deaths              21,631
    total_cases_mill          36,649
    total_deaths_mill         683
    population                35,134,957
    population_density        453
    median_age                30
    gdp_per_capita            19,141
    aged_65_older             9
    total_tests_thous         535
    life_expectancy           73
    hospital_beds_thous       3
    diabetes_prevalence       8 

The main difference between these three countries and others is that they have very low populations. Surprisingly, each has a much lower population density than average. That is the opposite of what you would expect and merits further consideration in the analysis we will do throughout this book.

An Alternative to the IQR Calculation

An alternative to using the interquartile range to identify an extreme value would be to use several standard deviations away from the mean, say 3. One drawback of this method is that it is a little more susceptible to extreme values than using the interquartile range.

I find it helpful to produce this kind of analysis for all the key targets and features in my data, so let's automate this method of identifying extreme values. We should also output the results to a file so that we can use them when we need them:

  1. Let's define a function, getextremevalues, that iterates over all of the columns of our DataFrame (except for the first one, which contains the location column), calculates the interquartile range for that column, selects all the observations with values above the high threshold or below the low threshold for that column, and then appends the results to a new DataFrame (dfout):
    def getextremevalues(dfin):
          dfout = pd.DataFrame(columns=dfin.columns, 
                               data=None)
          for col in dfin.columns[1:]:
            thirdq, firstq = dfin[col].quantile(0.75), \
              dfin[col].quantile(0.25)
            interquartilerange = 1.5*(thirdq-firstq)
            extvalhigh, extvallow = \
              interquartilerange+thirdq, \
              firstq-interquartilerange
            df = dfin.loc[(dfin[col]>extvalhigh) | (dfin[col]<extvallow)]
            df = df.assign(varname = col,
              threshlow = extvallow,
              threshhigh = extvalhigh)
            dfout = pd.concat([dfout, df])
        return dfout
  2. Now, we can pass our covidkeys DataFrame to the getextremevalues function to get a DataFrame that contains the extreme values for each column. Then, we can display the number of extreme values for each column, which tells us that there were four extreme values for the total deaths per million people in the population (total_deaths_mill). Now, we can output the data to an Excel file:
    extremevalues = getextremevalues(covidkeys)
    extremevalues.varname.value_counts()
    gdp_per_capita         9
    diabetes_prevalence    8
    total_deaths_mill      4
    total_cases_mill       3
    Name: varname, dtype: int64
    extremevalues.to_excel("views/extremevaluescases.xlsx")
  3. Let's take a closer look at the extreme deaths per million values. We can query the DataFrame we just created to get the threshhigh value for total_deaths_mill, which is 2654. We can also get other key features for those countries with the extreme values since we have included that data in the new DataFrame:
    extremevalues.loc[extremevalues.varname=="total_deaths_mill",
        'threshhigh'][0]
    2653.752
    extremevalues.loc[extremevalues.varname=="total_deaths_mill",
          keyvars].sort_values(['total_deaths_mill'], ascending=False)
          location              total_cases_mill  total_deaths_mill
    PER   Peru                    62,830            5,876
    HUN   Hungary                 83,676            3,105
    BIH   Bosnia and Herzegovina  62,499            2,947
    CZE   Czechia                 155,783           2,830
         _65_older  diabetes_prevalence  gdp_per_capita  
    PER      7              6                12,237
    HUN     19              8                26,778
    BIH     17             10                11,714
    CZE     19              7                32,606

Peru, Hungary, Bosnia and Herzegovina, and Czechia have total_deaths_mill above the extreme value threshold. One thing that stands out for three of these countries is how much above the average for percent of population 65 or older they are as well (the average for that feature, as we displayed in a preceding table, is 9). Although these are extreme values for deaths, the relationship between the elderly percentage of the population and COVID deaths may account for much of this and can do so without overfitting the model to these extreme cases. We will go through some strategies for teasing that out in the next chapter.

So far, we have discussed outliers and extreme values without referencing distribution shape. What we have implied so far is that an extreme value is a rare value – significantly rarer than the values near the center of the distribution. But this makes the most sense when the feature's distribution approaches normal. If, on the other hand, a feature had a uniform distribution, a very high value would be no rarer than any other value.

In practice, then, we think about extreme values or outliers relative to the distribution of the feature. Quantile-quantile (Q-Q) plots can improve our sense of that distribution by allowing us to view it graphically relative to a theoretical distribution: normal, uniform, log, or others. Let's take a look:

  1. Let's create a Q-Q plot of total cases per million that's relative to the normal distribution. We can use the statsmodels library for this:
    sm.qqplot(covidtotals[['total_cases_mill']]. \
      sort_values(['total_cases_mill']).dropna(),line='s')
    )
    plt.show()

This produces the following plot:

Figure 1.1 – Q-Q plot of total cases per million

Figure 1.1 – Q-Q plot of total cases per million

This Q-Q plot makes it clear that the distribution of total cases across countries is not normal. We can see this by how much the data points deviate from the red line. It is a Q-Q plot that we would expect from a distribution with some positive skew. This is consistent with the summary statistics we have already calculated for the total cases feature. It further reinforces our developing sense, in that we will need to be cautious about parametric models and that we will probably have to account for more than just one or two outlier observations.

Let's look at a Q-Q plot for a feature with a distribution that is a little closer to normal. There isn't a great candidate in the COVID data, so we will work with data from the United States National Oceanic and Atmospheric Administration on land temperatures in 2019.

Data Note

The land temperature dataset contains the average temperature readings (in Celsius) in 2019 from over 12,000 stations across the world, though the majority of the stations are in the United States. The raw data was retrieved from the Global Historical Climatology Network integrated database. It has been made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-2.

  1. First, let's load the data into a pandas DataFrame and run some descriptive statistics on the temperature feature, avgtemp. We must add a few percentile statistics to the normal describe output to get a better sense of the range of values. avgtemp is the average temperature for the year at each of the 12,095 weather stations. The average temperature across all stations was 11.2 degrees Celsius. The median was 10.4. However, there were some very negative values, including 14 weather stations with an average temperature of less than -25. This contributes to a moderately negative skew, though both the skew and kurtosis are closer to what would be expected from a normal distribution:
    landtemps = pd.read_csv("data/landtemps2019avgs.csv")
    landtemps.avgtemp.describe(percentiles=[0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95])
    count        12,095.0
    mean             11.2
    std               8.6
    min             -60.8
    5%               -0.7
    10%               1.7
    25%               5.4
    50%              10.4
    75%              16.9
    90%              23.1
    95%              27.0
    max              33.9
    Name: avgtemp, dtype: float64
     landtemps.loc[landtemps.avgtemp<-25,'avgtemp'].count()
    14
    landtemps.avgtemp.skew()
    -0.2678382583481768
    landtemps.avgtemp.kurtosis()
    2.1698313707061073
  2. Now, let's take a look at a Q-Q plot of the average temperature:
    sm.qqplot(landtemps.avgtemp.sort_values().dropna(), line='s')
    plt.tight_layout()
    plt.show()

This produces the following plot:

Figure 1.2 – Q-Q plot of average temperatures

Figure 1.2 – Q-Q plot of average temperatures

Along most of the range, the distribution of average temperatures looks pretty close to normal. The exceptions are the extremely low temperatures, contributing to a small amount of negative skew. There is also some deviation from normal at the high end, though this is much less of an issue (you may have noticed that Q-Q plots for features with negative skew have an umbrella-like shape, while those with positive skews, such as total cases, have more of a bowl-like shape).

We are off to a good start in our efforts to understand the distribution of possible features and targets and, to a related effort, to identify extreme values and outliers. This is important information to have at our fingertips when we construct, refine, and interpret our models. But there is more that we can do to improve our intuition about the data. A good next step is to construct visualizations of key features.