Data Cleaning and Exploration with Machine Learning

By : Michael Walker

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results. As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book. By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1 – Data Cleaning and Machine Learning Algorithms

Free Chapter

Chapter 1: Examining the Distribution of Features and Targets

Technical requirements

Subsetting data

Generating frequencies for categorical features

Generating summary statistics for continuous and discrete features

Identifying extreme values and outliers in univariate analysis

Using histograms, boxplots, and violin plots to examine the distribution of features

Summary

Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets

Technical requirements

Identifying outliers and extreme values in bivariate relationships

Using scatter plots to view bivariate relationships between continuous features

Using grouped boxplots to view bivariate relationships between continuous and categorical features

Using linear regression to identify data points with significant influence

Using K-nearest neighbors to find outliers

Using Isolation Forest to find outliers

Summary

Chapter 3: Identifying and Fixing Missing Values

Technical requirements

Identifying missing values

Cleaning missing values

Imputing values with regression

Using KNN imputation

Using random forest for imputation

Summary

Section 2 – Preprocessing, Feature Selection, and Sampling

Chapter 4: Encoding, Transforming, and Scaling Features

Technical requirements

Creating training datasets and avoiding data leakage

Removing redundant or unhelpful features

Encoding categorical features

Encoding categorical features with medium or high cardinality

Using mathematical transformations

Feature binning

Feature scaling

Summary

Chapter 5: Feature Selection

Technical requirements

Selecting features for classification models

Selecting features for regression models

Using forward and backward feature selection

Using exhaustive feature selection

Eliminating features recursively in a regression model

Eliminating features recursively in a classification model

Using Boruta for feature selection

Using regularization and other embedded methods

Using principal component analysis

Summary

Chapter 6: Preparing for Model Evaluation

Technical requirements

Measuring accuracy, sensitivity, specificity, and precision for binary classification

Examining CAP, ROC, and precision-sensitivity curves for binary classification

Evaluating multiclass models

Evaluating regression models

Using K-fold cross-validation

Preprocessing data with pipelines

Summary

Section 3 – Modeling Continuous Targets with Supervised Learning

Chapter 7: Linear Regression Models

Technical requirements

Key concepts

Linear regression and gradient descent

Using classical linear regression

Using lasso regression

Using non-linear regression

Regression with gradient descent

Summary

Chapter 8: Support Vector Regression

Technical requirements

Key concepts of SVR

SVR with a linear model

Using kernels for nonlinear SVR

Summary

Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression

Technical requirements

Key concepts for K-nearest neighbors regression

K-nearest neighbors regression

Key concepts for decision tree and random forest regression

Decision tree and random forest regression

Using gradient boosted regression

Summary

Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning

Chapter 10: Logistic Regression

Technical requirements

Key concepts of logistic regression

Binary classification with logistic regression

Regularization with logistic regression

Multinomial logistic regression

Summary

Chapter 11: Decision Trees and Random Forest Classification

Technical requirements

Key concepts

Decision tree models

Implementing random forest

Implementing gradient boosting

Summary

Chapter 12: K-Nearest Neighbors for Classification

Technical requirements

Key concepts of KNN

KNN for binary classification

KNN for multiclass classification

Summary

Chapter 13: Support Vector Machine Classification

Technical requirements

Key concepts for SVC

Linear SVC models

Nonlinear SVM classification models

SVMs for multiclass classification

Summary

Chapter 14: Naïve Bayes Classification

Technical requirements

Key concepts

Naïve Bayes classification models

Naïve Bayes for text classification

Summary

Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

Chapter 15: Principal Component Analysis

Technical requirements

Key concepts of PCA

Feature extraction with PCA

Using kernels with PCA

Summary

Chapter 16: K-Means and DBSCAN Clustering

Technical requirements

The key concepts of k-means and DBSCAN clustering

Implementing k-means clustering

Implementing DBSCAN clustering

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Generating summary statistics for continuous and discrete features

Getting a feel for the distribution of continuous or discrete features is a little more complicated than it is for categorical features. A continuous feature can take an infinite number of values. An example of a continuous feature is weight, as someone can weigh 70 kilograms, or 70.1, or 70.01. Discrete features have a finite number of values, such as the number of birds sighted, or the number of apples purchased. One way of thinking about the difference is that a discrete feature is typically something that has been counted, while a continuous feature is usually captured by measurement, weighing, or timekeeping.

Continuous features will generally be stored as floating-point numbers unless they have been constrained to be whole numbers. In that case, they may be stored as integers. Age for individual humans, for example, is continuous but is usually truncated to an integer.

For most modeling purposes, continuous and discrete features are treated similarly. We would not model age as a categorical feature. We assume that the interval between ages has largely the same meaning between 25 and 26 as it has between 35 and 36, though this breaks down at the extremes. The interval between 1 and 2 years of age for humans is not at all like that between 71 and 72. Data analysts and scientists are usually skeptical of assumed linear relationships between continuous features and targets, though modeling is much easier when that is true.

To understand how a continuous feature (or discrete feature) is distributed, we must examine its central tendency, shape, and spread. Key summary statistics are mean and median for central tendency, skewness and kurtosis for shape, and range, interquartile range, variance, and standard deviation for spread. In this section, we will learn how to use pandas, supplemented by the SciPy library, to get these statistics. We will also discuss important implications for modeling.

We will work with COVID-19 data in this section. The dataset contains one row per country, with total cases and deaths through June 2021, as well as demographic data for each country.

Note

Our World in Data provides COVID-19 public use data at https://ourworldindata.org/coronavirus-source-data. The data that will be used in this section was downloaded on July 9, 2021. There are more columns in the data than I have included. I created the region column based on country.

Follow these steps to generate the summary statistics:

Let's load the COVID .csv file into pandas, set the index, and look at the data. There are 221 rows and 16 columns. The index we set, iso_code, contains a unique value for each row. We use sample to view two countries randomly, rather than the first two (we set a value for random_state to get the same results each time we run the code):

import pandas as pd
import numpy as np
import scipy.stats as scistat
covidtotals = pd.read_csv("data/covidtotals.csv",
    parse_dates=['lastdate'])
covidtotals.set_index("iso_code", inplace=True)
covidtotals.shape
(221, 16)
covidtotals.index.nunique()
221
covidtotals.sample(2, random_state=6).T
iso_code                         ISL               CZE
lastdate                  2021-07-07        2021-07-07
location                     Iceland           Czechia
total_cases                    6,555         1,668,277
total_deaths                      29            30,311
total_cases_mill              19,209           155,783
total_deaths_mill                 85             2,830
population                   341,250        10,708,982
population_density                 3               137
median_age                        37                43
gdp_per_capita                46,483            32,606
aged_65_older                     14                19
total_tests_thous                NaN               NaN
life_expectancy                   83                79
hospital_beds_thous                3                 7
diabetes_prevalence                5                 7
region                Western Europe    Western Europe

Just by looking at these two rows, we can see significant differences in cases and deaths between Iceland and Czechia, even in terms of population size. (total_cases_mill and total_deaths_mill divide cases and deaths per million of population, respectively.) Data analysts are very used to wondering if there is anything else in the data that may explain substantially higher cases and deaths in Czechia than in Iceland. In a sense, we are always engaging in feature selection.

Let's take a look at the data types and number of non-null values for each column. Almost all of the columns are continuous or discrete. We have data on cases and deaths, as well as likely targets, for 192 and 185 rows, respectively. An important data cleaning task we'll have to do will be figuring out what, if anything, we can do about countries that have missing values for our targets. We'll discuss how to handle missing values later:

covidtotals.info()
<class 'pandas.core.frame.DataFrame'>
Index: 221 entries, AFG to ZWE
Data columns (total 16 columns):
 #   Column             Non-Null Count         Dtype 
---  -------            --------------  --------------
 0   lastdate             221 non-null  datetime64[ns]
 1   location             221 non-null          object
 2   total_cases          192 non-null         float64
 3   total_deaths         185 non-null         float64
 4   total_cases_mill     192 non-null         float64
 5   total_deaths_mill    185 non-null         float64
 6   population           221 non-null         float64
 7   population_density   206 non-null         float64
 8   median_age           190 non-null         float64
 9   gdp_per_capita       193 non-null         float64
 10  aged_65_older        188 non-null         float64
 11  total_tests_thous     13 non-null         float64
 12  life_expectancy      217 non-null         float64
 13  hospital_beds_thous  170 non-null         float64
 14  diabetes_prevalence  200 non-null         float64
 15  region               221 non-null          object
dtypes: datetime64[ns](1), float64(13), object(2)
memory usage: 29.4+ KB

Now, we are ready to examine the distribution of some of the features. We can get most of the summary statistics we want by using the describe method. The mean and median (50%) are good indicators of the center of the distribution, each with its strengths. It is also good to notice substantial differences between the mean and median, as an indication of skewness. For example, we can see that the mean cases per million is almost twice the median, with 36.7 thousand compared to 19.5 thousand. This is a clear indicator of positive skew. This is also true for deaths per million.

The interquartile range is also quite large for cases and deaths, with the 75th percentile value being about 25 times larger than the 25th percentile value in both cases. We can compare that with the percentage of the population aged 65 and older and diabetes prevalence, where the 75th percentile is just four times or two times that of the 25th percentile, respectively. We can tell right away that those two possible features (aged_65_older and diabetes_prevalence) would have to do a lot of work to explain the huge variance in our targets:

 keyvars = ['location','total_cases_mill','total_deaths_mill',
...  'aged_65_older','diabetes_prevalence']
 covidkeys = covidtotals[keyvars]
 covidkeys.describe()
total_cases_mill total_deaths_mill aged_65_older diabetes_prevalence
count        192.00       185.00    188.00     200.00
mean      36,649.37       683.14      8.61       8.44
std       41,403.98       861.73      6.12       4.89
min            8.52         0.35      1.14       0.99
25%        2,499.75        43.99      3.50       5.34
50%       19,525.73       293.50      6.22       7.20
75%       64,834.62     1,087.89     13.92      10.61
max      181,466.38     5,876.01     27.05      30.53

I sometimes find it helpful to look at the decile values to get a better sense of the distribution. The quantile method can take a single value for quantile, such as quantile(0.25) for the 25th percentile, or a list or tuple, such as quantile((0.25,0.5)) for the 25th and 50th percentiles. In the following code, we're using arange from NumPy (np.arange(0.0, 1.1, 0.1)) to generate an array that goes from 0.0 to 1.0 with a 0.1 increment. We would get the same result if we were to use covidkeys.quantile([0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0]):

 covidkeys.quantile(np.arange(0.0, 1.1, 0.1))
      total_cases_mill  total_deaths_mill  aged_65_older  diabetes_prevalence
0.00         8.52       0.35       1.14     0.99
0.10       682.13      10.68       2.80     3.30
0.20     1,717.39      30.22       3.16     4.79
0.30     3,241.84      66.27       3.86     5.74
0.40     9,403.58      145.06      4.69     6.70
0.50     19,525.73     293.50      6.22     7.20
0.60     33,636.47     556.43      7.93     8.32
0.70     55,801.33     949.71     11.19    10.08
0.80     74,017.81    1,333.79    14.92    11.62
0.90     94,072.18    1,868.89    18.85    13.75
1.00    181,466.38    5,876.01    27.05    30.53

For cases, deaths, and diabetes prevalence, much of the range (the distance between the min and max values) is in the last 10% of the distribution. This is particularly true for deaths. This hints at possible modeling problems and invites us to take a close look at outliers, something we will do in the next section.

Some machine learning algorithms assume that our features have normal (also referred to as Gaussian) distributions, that they are distributed symmetrically (have low skew), and that they have relatively normal tails (neither excessively high nor excessively low kurtosis). The statistics we have seen so far already suggest a high positive skew for our two likely targets – that is, total cases and deaths per million people in the population. Let's put a finer point on this by calculating both skew and kurtosis for some of the features. For a Gaussian distribution, we expect a value near 0 for skew and 3 for kurtosis. total_deaths_mill has values for skew and kurtosis that are worth noting, and the total_cases_mill and aged_65_older features have excessively low kurtosis (skinny tails):
```
covidkeys.skew()
total_cases_mill        1.21
total_deaths_mill       2.00
aged_65_older           0.84
diabetes_prevalence     1.52
dtype: float64
 covidkeys.kurtosis()
total_cases_mill        0.91
total_deaths_mill       6.58
aged_65_older          -0.56
diabetes_prevalence     3.31
dtype: float64
```
We can also explicitly test each distribution's normality by looping over the features in the keyvars list and running a Shapiro-Wilk test on the distribution (scistat.shapiro(covidkeys[var].dropna())). Notice that we need to drop missing values with dropna for the test to run. p-values less than 0.05 indicate that we can reject the null hypothesis of normal, which is the case for each of the four features:
```
for var in keyvars[1:]:
      stat, p = scistat.shapiro(covidkeys[var].dropna())
      print("feature=", var, "     p-value=", '{:.6f}'.format(p))
 
feature= total_cases_mill       p-value= 0.000000
feature= total_deaths_mill      p-value= 0.000000
feature= aged_65_older          p-value= 0.000000
feature= diabetes_prevalence    p-value= 0.000000
```

These results should make us pause if we are considering parametric models such as linear regression. None of the distributions approximates a normal distribution. However, this is not determinative. It is not as simple as deciding that we should use certain models when we have normally distributed features and non-parametric models (say, k-nearest neighbors) when we do not.

We want to do additional data cleaning before we make any modeling decisions. For example, we may decide to remove outliers or determine that it is appropriate to transform the data. We will explore transformations, such as log and polynomial transformations, in several chapters in this book.

This section showed you how to use pandas and SciPy to understand how continuous and discrete features are distributed, including their central tendency, shape, and spread. It makes sense to generate these statistics for any feature or target that might be included in our modeling. This also points us in the direction of more work we need to do to prepare our data for analysis. We need to identify missing values and outliers and figure out how we will handle them. We should also visualize the distribution of our continuous features. This rarely fails to yield additional insights. We will learn how to identify outliers in the next section and create visualizations in the following section.

Data Cleaning and Exploration with Machine Learning

By : Michael Walker

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Related Content you might be interested in

Current Title:

Data Cleaning and Exploration with Machine Learning

Python Data Cleaning Cookbook

scikit-learn Cookbook

Data Science Projects with Python

Generating summary statistics for continuous and discrete features