Data Cleaning and Exploration with Machine Learning

By : Michael Walker

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results. As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book. By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1 – Data Cleaning and Machine Learning Algorithms

Free Chapter

Chapter 1: Examining the Distribution of Features and Targets

Technical requirements

Subsetting data

Generating frequencies for categorical features

Generating summary statistics for continuous and discrete features

Identifying extreme values and outliers in univariate analysis

Using histograms, boxplots, and violin plots to examine the distribution of features

Summary

Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets

Technical requirements

Identifying outliers and extreme values in bivariate relationships

Using scatter plots to view bivariate relationships between continuous features

Using grouped boxplots to view bivariate relationships between continuous and categorical features

Using linear regression to identify data points with significant influence

Using K-nearest neighbors to find outliers

Using Isolation Forest to find outliers

Summary

Chapter 3: Identifying and Fixing Missing Values

Technical requirements

Identifying missing values

Cleaning missing values

Imputing values with regression

Using KNN imputation

Using random forest for imputation

Summary

Section 2 – Preprocessing, Feature Selection, and Sampling

Chapter 4: Encoding, Transforming, and Scaling Features

Technical requirements

Creating training datasets and avoiding data leakage

Removing redundant or unhelpful features

Encoding categorical features

Encoding categorical features with medium or high cardinality

Using mathematical transformations

Feature binning

Feature scaling

Summary

Chapter 5: Feature Selection

Technical requirements

Selecting features for classification models

Selecting features for regression models

Using forward and backward feature selection

Using exhaustive feature selection

Eliminating features recursively in a regression model

Eliminating features recursively in a classification model

Using Boruta for feature selection

Using regularization and other embedded methods

Using principal component analysis

Summary

Chapter 6: Preparing for Model Evaluation

Technical requirements

Measuring accuracy, sensitivity, specificity, and precision for binary classification

Examining CAP, ROC, and precision-sensitivity curves for binary classification

Evaluating multiclass models

Evaluating regression models

Using K-fold cross-validation

Preprocessing data with pipelines

Summary

Section 3 – Modeling Continuous Targets with Supervised Learning

Chapter 7: Linear Regression Models

Technical requirements

Key concepts

Linear regression and gradient descent

Using classical linear regression

Using lasso regression

Using non-linear regression

Regression with gradient descent

Summary

Chapter 8: Support Vector Regression

Technical requirements

Key concepts of SVR

SVR with a linear model

Using kernels for nonlinear SVR

Summary

Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression

Technical requirements

Key concepts for K-nearest neighbors regression

K-nearest neighbors regression

Key concepts for decision tree and random forest regression

Decision tree and random forest regression

Using gradient boosted regression

Summary

Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning

Chapter 10: Logistic Regression

Technical requirements

Key concepts of logistic regression

Binary classification with logistic regression

Regularization with logistic regression

Multinomial logistic regression

Summary

Chapter 11: Decision Trees and Random Forest Classification

Technical requirements

Key concepts

Decision tree models

Implementing random forest

Implementing gradient boosting

Summary

Chapter 12: K-Nearest Neighbors for Classification

Technical requirements

Key concepts of KNN

KNN for binary classification

KNN for multiclass classification

Summary

Chapter 13: Support Vector Machine Classification

Technical requirements

Key concepts for SVC

Linear SVC models

Nonlinear SVM classification models

SVMs for multiclass classification

Summary

Chapter 14: Naïve Bayes Classification

Technical requirements

Key concepts

Naïve Bayes classification models

Naïve Bayes for text classification

Summary

Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

Chapter 15: Principal Component Analysis

Technical requirements

Key concepts of PCA

Feature extraction with PCA

Using kernels with PCA

Summary

Chapter 16: K-Means and DBSCAN Clustering

Technical requirements

The key concepts of k-means and DBSCAN clustering

Implementing k-means clustering

Implementing DBSCAN clustering

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Identifying extreme values and outliers in univariate analysis

An outlier can be thought of as an observation with feature values, or relationships between feature values, that are so unusual that they cannot help explain relationships in the rest of the data. This matters for modeling because we cannot assume that the outliers will have a neutral impact on our parameter estimates. Sometimes, our models work so hard to construct parameter estimates that can account for patterns in outlier observations that we compromise the model's explanatory or predictive power for all other observations. Raise your hand if you have ever spent days trying to interpret a model only to discover that your coefficients and predictions completely changed once you removed a few outliers.

I should quickly add that there is no agreed-upon definition of an outlier. I offer the preceding definition for use in this book because it helps us distinguish between outliers, as I have described them, and extreme values. There is a fair bit of overlap between the two, but many extreme values are not outliers. This is because such values reflect a natural and explainable trend in a feature, or because they reflect the same relationship between features as is observed throughout the data. The reverse is also true. Some outliers are not extreme values. For example, a target value might be right in the middle of the distribution but have quite unexpected predictor values.

For our modeling, then, it is hard to say that a particular feature or target value is an outlier without referencing multivariate relationships. But it should at least raise a red flag when, in our univariate analysis, we see values well to the left or right of the center. This should prompt us to investigate the observation at that value further, including examining the values of other features. We will look at multivariate relationships in more detail in the next two chapters. Here, and in the next section on visualizations, we will focus on identifying extreme values and outliers when looking at one variable.

A good starting point for identifying an extreme value is to look at its distance from the middle of the distribution. One common method for doing that is to calculate each value's distance from the interquartile range (IQR), which is the distance between the first quartile value and the third quartile value. We often flag any value that is more than 1.5 times the interquartile range above the third quartile or below the first quartile. We can use this method to identify outliers in the COVID-19 data.

Let's get started:

Let's start by importing the libraries we will need. In addition to pandas and NumPy, we will use Matplotlib and statsmodels for the plots we will create. We will also load the COVID data and select the variables we need:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
covidtotals = pd.read_csv("data/covidtotals.csv")
covidtotals.set_index("iso_code", inplace=True)
keyvars = ['location','total_cases_mill','total_deaths_mill',
  'aged_65_older','diabetes_prevalence','gdp_per_capita']
covidkeys = covidtotals[keyvars]

Let's take a look at total_cases_mill. We get the first and third quartile values and calculate the interquartile range, 1.5*(thirdq-firstq). Then, we calculate their thresholds to determine the high and low extreme values, which are interquartilerange+thirdq and firstq-interquartilerange, respectively (if you are familiar with boxplots, you will notice that this is the same calculation that's used for the whiskers of a boxplot; we will cover boxplots in the next section):
```
thirdq, firstq = covidkeys.total_cases_mill.quantile(0.75), covidkeys.total_cases_mill.quantile(0.25)
interquartilerange = 1.5*(thirdq-firstq)
extvalhigh, extvallow = interquartilerange+thirdq, firstq-interquartilerange
print(extvallow, extvalhigh, sep=" <--> ")
-91002.564625 <--> 158336.930375
```

This calculation indicates that any value for total_cases_mill that's above 158,337 can be considered extreme. We can ignore extreme values on the low end because they would be negative:

covidtotals.loc[covidtotals.total_cases_mill>extvalhigh].T
iso_code                   AND         MNE         SYC
lastdate            2021-07-07  2021-07-07  2021-07-07
location            Andorra     Montenegro  Seychelles
total_cases          14,021        100,392      16,304
total_deaths            127          1,619          71
total_cases_mill    181,466        159,844     165,792
total_deaths_mill     1,644          2,578         722
population           77,265        628,062      98,340
population_density      164             46         208
median_age              NaN             39          36
gdp_per_capita          NaN         16,409      26,382
aged_65_older           NaN             15           9
total_tests_thous       NaN            NaN         NaN
life_expectancy          84             77          73
hospital_beds_thous     NaN              4           4
diabetes_prevalence       8             10          11
region              Western        Eastern        East 
                     Europe         Europe      Africa

Andorra, Montenegro, and Seychelles all have total_cases_mill above the threshold amount. This invites us to explore other ways these countries might be exceptional, and whether our features can capture that. We will not dive deeply into multivariate analysis here since we will do that in the next chapter, but it is a good idea to start wrapping our brains around why these extreme values may or may not make sense. Having some means across the full dataset may help us here:

covidtotals.mean()
total_cases               963,933
total_deaths              21,631
total_cases_mill          36,649
total_deaths_mill         683
population                35,134,957
population_density        453
median_age                30
gdp_per_capita            19,141
aged_65_older             9
total_tests_thous         535
life_expectancy           73
hospital_beds_thous       3
diabetes_prevalence       8

The main difference between these three countries and others is that they have very low populations. Surprisingly, each has a much lower population density than average. That is the opposite of what you would expect and merits further consideration in the analysis we will do throughout this book.

An Alternative to the IQR Calculation

An alternative to using the interquartile range to identify an extreme value would be to use several standard deviations away from the mean, say 3. One drawback of this method is that it is a little more susceptible to extreme values than using the interquartile range.

I find it helpful to produce this kind of analysis for all the key targets and features in my data, so let's automate this method of identifying extreme values. We should also output the results to a file so that we can use them when we need them:

Let's define a function, getextremevalues, that iterates over all of the columns of our DataFrame (except for the first one, which contains the location column), calculates the interquartile range for that column, selects all the observations with values above the high threshold or below the low threshold for that column, and then appends the results to a new DataFrame (dfout):

def getextremevalues(dfin):
      dfout = pd.DataFrame(columns=dfin.columns, 
                           data=None)
      for col in dfin.columns[1:]:
        thirdq, firstq = dfin[col].quantile(0.75), \
          dfin[col].quantile(0.25)
        interquartilerange = 1.5*(thirdq-firstq)
        extvalhigh, extvallow = \
          interquartilerange+thirdq, \
          firstq-interquartilerange
        df = dfin.loc[(dfin[col]>extvalhigh) | (dfin[col]<extvallow)]
        df = df.assign(varname = col,
          threshlow = extvallow,
          threshhigh = extvalhigh)
        dfout = pd.concat([dfout, df])
    return dfout

Now, we can pass our covidkeys DataFrame to the getextremevalues function to get a DataFrame that contains the extreme values for each column. Then, we can display the number of extreme values for each column, which tells us that there were four extreme values for the total deaths per million people in the population (total_deaths_mill). Now, we can output the data to an Excel file:
```
extremevalues = getextremevalues(covidkeys)
extremevalues.varname.value_counts()
gdp_per_capita         9
diabetes_prevalence    8
total_deaths_mill      4
total_cases_mill       3
Name: varname, dtype: int64
extremevalues.to_excel("views/extremevaluescases.xlsx")
```

Let's take a closer look at the extreme deaths per million values. We can query the DataFrame we just created to get the threshhigh value for total_deaths_mill, which is 2654. We can also get other key features for those countries with the extreme values since we have included that data in the new DataFrame:

extremevalues.loc[extremevalues.varname=="total_deaths_mill",
    'threshhigh'][0]
2653.752
extremevalues.loc[extremevalues.varname=="total_deaths_mill",
      keyvars].sort_values(['total_deaths_mill'], ascending=False)
      location              total_cases_mill  total_deaths_mill
PER   Peru                    62,830            5,876
HUN   Hungary                 83,676            3,105
BIH   Bosnia and Herzegovina  62,499            2,947
CZE   Czechia                 155,783           2,830
     _65_older  diabetes_prevalence  gdp_per_capita  
PER      7              6                12,237
HUN     19              8                26,778
BIH     17             10                11,714
CZE     19              7                32,606

Peru, Hungary, Bosnia and Herzegovina, and Czechia have total_deaths_mill above the extreme value threshold. One thing that stands out for three of these countries is how much above the average for percent of population 65 or older they are as well (the average for that feature, as we displayed in a preceding table, is 9). Although these are extreme values for deaths, the relationship between the elderly percentage of the population and COVID deaths may account for much of this and can do so without overfitting the model to these extreme cases. We will go through some strategies for teasing that out in the next chapter.

So far, we have discussed outliers and extreme values without referencing distribution shape. What we have implied so far is that an extreme value is a rare value – significantly rarer than the values near the center of the distribution. But this makes the most sense when the feature's distribution approaches normal. If, on the other hand, a feature had a uniform distribution, a very high value would be no rarer than any other value.

In practice, then, we think about extreme values or outliers relative to the distribution of the feature. Quantile-quantile (Q-Q) plots can improve our sense of that distribution by allowing us to view it graphically relative to a theoretical distribution: normal, uniform, log, or others. Let's take a look:

Let's create a Q-Q plot of total cases per million that's relative to the normal distribution. We can use the statsmodels library for this:
```
sm.qqplot(covidtotals[['total_cases_mill']]. \
  sort_values(['total_cases_mill']).dropna(),line='s')
)
plt.show()
```

This produces the following plot:

Figure 1.1 – Q-Q plot of total cases per million

This Q-Q plot makes it clear that the distribution of total cases across countries is not normal. We can see this by how much the data points deviate from the red line. It is a Q-Q plot that we would expect from a distribution with some positive skew. This is consistent with the summary statistics we have already calculated for the total cases feature. It further reinforces our developing sense, in that we will need to be cautious about parametric models and that we will probably have to account for more than just one or two outlier observations.

Let's look at a Q-Q plot for a feature with a distribution that is a little closer to normal. There isn't a great candidate in the COVID data, so we will work with data from the United States National Oceanic and Atmospheric Administration on land temperatures in 2019.

Data Note

The land temperature dataset contains the average temperature readings (in Celsius) in 2019 from over 12,000 stations across the world, though the majority of the stations are in the United States. The raw data was retrieved from the Global Historical Climatology Network integrated database. It has been made available for public use by the United States National Oceanic and Atmospheric Administration at https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-monthly-version-2.

First, let's load the data into a pandas DataFrame and run some descriptive statistics on the temperature feature, avgtemp. We must add a few percentile statistics to the normal describe output to get a better sense of the range of values. avgtemp is the average temperature for the year at each of the 12,095 weather stations. The average temperature across all stations was 11.2 degrees Celsius. The median was 10.4. However, there were some very negative values, including 14 weather stations with an average temperature of less than -25. This contributes to a moderately negative skew, though both the skew and kurtosis are closer to what would be expected from a normal distribution:

landtemps = pd.read_csv("data/landtemps2019avgs.csv")
landtemps.avgtemp.describe(percentiles=[0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95])
count        12,095.0
mean             11.2
std               8.6
min             -60.8
5%               -0.7
10%               1.7
25%               5.4
50%              10.4
75%              16.9
90%              23.1
95%              27.0
max              33.9
Name: avgtemp, dtype: float64
 landtemps.loc[landtemps.avgtemp<-25,'avgtemp'].count()
14
landtemps.avgtemp.skew()
-0.2678382583481768
landtemps.avgtemp.kurtosis()
2.1698313707061073

Now, let's take a look at a Q-Q plot of the average temperature:

sm.qqplot(landtemps.avgtemp.sort_values().dropna(), line='s')
plt.tight_layout()
plt.show()

This produces the following plot:

Figure 1.2 – Q-Q plot of average temperatures

Along most of the range, the distribution of average temperatures looks pretty close to normal. The exceptions are the extremely low temperatures, contributing to a small amount of negative skew. There is also some deviation from normal at the high end, though this is much less of an issue (you may have noticed that Q-Q plots for features with negative skew have an umbrella-like shape, while those with positive skews, such as total cases, have more of a bowl-like shape).

We are off to a good start in our efforts to understand the distribution of possible features and targets and, to a related effort, to identify extreme values and outliers. This is important information to have at our fingertips when we construct, refine, and interpret our models. But there is more that we can do to improve our intuition about the data. A good next step is to construct visualizations of key features.

Data Cleaning and Exploration with Machine Learning

By : Michael Walker

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Related Content you might be interested in

Current Title:

Data Cleaning and Exploration with Machine Learning

Python Data Cleaning Cookbook

scikit-learn Cookbook

Data Science Projects with Python

Identifying extreme values and outliers in univariate analysis