Data Cleaning and Exploration with Machine Learning

By : Michael Walker

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results. As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book. By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1 – Data Cleaning and Machine Learning Algorithms

Free Chapter

Chapter 1: Examining the Distribution of Features and Targets

Technical requirements

Subsetting data

Generating frequencies for categorical features

Generating summary statistics for continuous and discrete features

Identifying extreme values and outliers in univariate analysis

Using histograms, boxplots, and violin plots to examine the distribution of features

Summary

Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets

Technical requirements

Identifying outliers and extreme values in bivariate relationships

Using scatter plots to view bivariate relationships between continuous features

Using grouped boxplots to view bivariate relationships between continuous and categorical features

Using linear regression to identify data points with significant influence

Using K-nearest neighbors to find outliers

Using Isolation Forest to find outliers

Summary

Chapter 3: Identifying and Fixing Missing Values

Technical requirements

Identifying missing values

Cleaning missing values

Imputing values with regression

Using KNN imputation

Using random forest for imputation

Summary

Section 2 – Preprocessing, Feature Selection, and Sampling

Chapter 4: Encoding, Transforming, and Scaling Features

Technical requirements

Creating training datasets and avoiding data leakage

Removing redundant or unhelpful features

Encoding categorical features

Encoding categorical features with medium or high cardinality

Using mathematical transformations

Feature binning

Feature scaling

Summary

Chapter 5: Feature Selection

Technical requirements

Selecting features for classification models

Selecting features for regression models

Using forward and backward feature selection

Using exhaustive feature selection

Eliminating features recursively in a regression model

Eliminating features recursively in a classification model

Using Boruta for feature selection

Using regularization and other embedded methods

Using principal component analysis

Summary

Chapter 6: Preparing for Model Evaluation

Technical requirements

Measuring accuracy, sensitivity, specificity, and precision for binary classification

Examining CAP, ROC, and precision-sensitivity curves for binary classification

Evaluating multiclass models

Evaluating regression models

Using K-fold cross-validation

Preprocessing data with pipelines

Summary

Section 3 – Modeling Continuous Targets with Supervised Learning

Chapter 7: Linear Regression Models

Technical requirements

Key concepts

Linear regression and gradient descent

Using classical linear regression

Using lasso regression

Using non-linear regression

Regression with gradient descent

Summary

Chapter 8: Support Vector Regression

Technical requirements

Key concepts of SVR

SVR with a linear model

Using kernels for nonlinear SVR

Summary

Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression

Technical requirements

Key concepts for K-nearest neighbors regression

K-nearest neighbors regression

Key concepts for decision tree and random forest regression

Decision tree and random forest regression

Using gradient boosted regression

Summary

Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning

Chapter 10: Logistic Regression

Technical requirements

Key concepts of logistic regression

Binary classification with logistic regression

Regularization with logistic regression

Multinomial logistic regression

Summary

Chapter 11: Decision Trees and Random Forest Classification

Technical requirements

Key concepts

Decision tree models

Implementing random forest

Implementing gradient boosting

Summary

Chapter 12: K-Nearest Neighbors for Classification

Technical requirements

Key concepts of KNN

KNN for binary classification

KNN for multiclass classification

Summary

Chapter 13: Support Vector Machine Classification

Technical requirements

Key concepts for SVC

Linear SVC models

Nonlinear SVM classification models

SVMs for multiclass classification

Summary

Chapter 14: Naïve Bayes Classification

Technical requirements

Key concepts

Naïve Bayes classification models

Naïve Bayes for text classification

Summary

Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

Chapter 15: Principal Component Analysis

Technical requirements

Key concepts of PCA

Feature extraction with PCA

Using kernels with PCA

Summary

Chapter 16: K-Means and DBSCAN Clustering

Technical requirements

The key concepts of k-means and DBSCAN clustering

Implementing k-means clustering

Implementing DBSCAN clustering

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Generating frequencies for categorical features

Categorical features can be either nominal or ordinal. Nominal features, such as gender, species name, or country, have a limited number of possible values, and are either strings or are numerical without having any intrinsic numerical meaning. For example, if country is represented by 1 for Afghanistan, 2 for Albania, and so on, the data is numerical but it does not make sense to perform arithmetic operations on those values.

Ordinal features also have a limited number of possible values but are different from nominal features in that the order of the values matters. A Likert scale rating (ranging from 1 for very unlikely to 5 for very likely) is an example of an ordinal feature. Nonetheless, arithmetic operations would not typically make sense because there is no uniform and meaningful distance between values.

Before we begin modeling, we want to have counts of all the possible values for the categorical features we may use. This is typically referred to as a one-way frequency distribution. Fortunately, pandas makes this very easy to do. We can quickly select columns from a pandas DataFrame and use the value_counts method to generate counts for each categorical value:

Let's load the NLS data, create a DataFrame that contains just the first 20 columns of the data, and look at the data types:

nls97 = pd.read_csv("data/nls97.csv")
nls97.set_index("personid", inplace=True)
nls97abb = nls97.iloc[:,:20]
nls97abb.dtypes
gender                   object
birthmonth               int64
birthyear                int64
highestgradecompleted    float64
maritalstatus            object
childathome              float64
childnotathome           float64
wageincome               float64
weeklyhrscomputer        object
weeklyhrstv              object
nightlyhrssleep          float64
satverbal                float64
satmath                  float64
gpaoverall               float64
gpaenglish               float64
gpamath                  float64
gpascience               float64
highestdegree            object
govprovidejobs           object
govpricecontrols         object
dtype: object

Note

Recall from the previous section how column and row selection works with the loc and iloc accessors. The colon to the left of the comma indicates that we want all the rows, while :20 to the right of the comma gets us the first 20 columns.

All of the object type columns in the preceding code are categorical. We can use value_counts to see the counts for each value for maritalstatus. We can also use dropna=False to get value_counts to show the missing values (NaN):

nls97abb.maritalstatus.value_counts(dropna=False)
Married          3066
Never-married    2766
NaN              2312
Divorced         663
Separated        154
Widowed          23
Name: maritalstatus, dtype: int64

If we just want the number of missing values, we can chain the isnull and sum methods. isnull returns a Boolean Series containing True values when maritalstatus is missing and False otherwise. sum then counts the number of True values, since it will interpret True values as 1 and False values as 0:
```
nls97abb.maritalstatus.isnull().sum()
2312
```
You have probably noticed that the maritalstatus values were sorted by frequency by default. You can sort them alphabetically by values by sorting the index. We can do this by taking advantage of the fact that value_counts returns a Series with the values as the index:
```
marstatcnt = nls97abb.maritalstatus.value_counts(dropna=False)
type(marstatcnt)
<class 'pandas.core.series.Series'>
marstatcnt.index
Index(['Married', 'Never-married', nan, 'Divorced', 'Separated', 'Widowed'], dtype='object')
```

To sort the index, we just need to call sort_index:

marstatcnt.sort_index()
Divorced         663
Married          3066
Never-married    2766
Separated        154
Widowed          23
NaN              2312
Name: maritalstatus, dtype: int64

Of course, we could have gotten the same results in one step with nls97.maritalstatus.value_counts(dropna=False).sort_index(). We can also show ratios instead of counts by setting normalize to True. In the following code, we can see that 34% of the responses were Married (notice that we did not set dropna to True, so missing values have been excluded):

nls97.maritalstatus.\
  value_counts(normalize=True, dropna=False).\
     sort_index()
 
Divorced             0.07
Married              0.34
Never-married        0.31
Separated            0.02
Widowed              0.00
NaN                  0.26
Name: maritalstatus, dtype: float64

pandas has a category data type that can store data much more efficiently than the object data type when a column has a limited number of values. Since we already know that all of our object columns contain categorical data, we should convert those columns into the category data type. In the following code, we're creating a list that contains the column names for the object columns, catcols. Then, we're looping through those columns and using astype to change the data type to category:

catcols = nls97abb.select_dtypes(include=["object"]).columns
for col in nls97abb[catcols].columns:
...      nls97abb[col] = nls97abb[col].astype('category')
... 
nls97abb[catcols].dtypes
gender                   category
maritalstatus            category
weeklyhrscomputer        category
weeklyhrstv              category
highestdegree            category
govprovidejobs           category
govpricecontrols         category
dtype: object

Let's check our category features for missing values. There are no missing values for gender and very few for highestdegree. But the overwhelming majority of values for govprovidejobs (the government should provide jobs) and govpricecontrols (the government should control prices) are missing. This means that those features probably won't be useful for most modeling:
```
nls97abb[catcols].isnull().sum()
gender               0
maritalstatus        2312
weeklyhrscomputer    2274
weeklyhrstv          2273
highestdegree        31
govprovidejobs       7151
govpricecontrols     7125
dtype: int64
```

We can generate frequencies for multiple features at once by passing a value_counts call to apply. We can use filter to select the columns that we want – in this case, all the columns with gov in their name. Note that the missing values for each feature have been omitted since we did not set dropna to False:

 nls97abb.filter(like="gov").apply(pd.value_counts, normalize=True)
                 govprovidejobs    govpricecontrols
1. Definitely              0.25                0.54
2. Probably                0.34                0.33
3. Probably not            0.25                0.09
4. Definitely not          0.16                0.04

We can use the same frequencies on a subset of our data. If, for example, we want to see the responses of only married people to the government role questions, we can do that subsetting by placing nls97abb[nls97abb.maritalstatus=="Married"] before filter:

 nls97abb.loc[nls97abb.maritalstatus=="Married"].\
 filter(like="gov").\
   apply(pd.value_counts, normalize=True)
                 govprovidejobs    govpricecontrols
1. Definitely              0.17                0.46
2. Probably                0.33                0.38
3. Probably not            0.31                0.11
4. Definitely not          0.18                0.05

Since, in this case, there were only two gov columns, it may have been easier to do the following:

 nls97abb.loc[nls97abb.maritalstatus=="Married",
   ['govprovidejobs','govpricecontrols']].\
   apply(pd.value_counts, normalize=True)
                  govprovidejobs     govpricecontrols
1. Definitely               0.17                 0.46
2. Probably                 0.33                 0.38
3. Probably not             0.31                 0.11
4. Definitely not           0.18                 0.05

Nonetheless, it will often be easier to use filter since it is not unusual to have to do the same cleaning or exploration task on groups of features with similar names.

There are times when we may want to model a continuous or discrete feature as categorical. The NLS DataFrame contains highestgradecompleted. A year increase from 5 to 6 may not be as important as that from 11 to 12 in terms of its impact on a target. Let's create a dichotomous feature instead – that is, 1 when the person has completed 12 or more grades, 0 if they have completed less than that, and missing when highestgradecompleted is missing.

We need to do a little bit of cleaning up first, though. highestgradecompleted has two logical missing values – an actual NaN value that pandas recognizes as missing and a 95 value that the survey designers intend for us to also treat as missing for most use cases. Let's use replace to fix that before moving on:
```
nls97abb.highestgradecompleted.\
  replace(95, np.nan, inplace=True)
```

We can use NumPy's where function to assign values to highschoolgrad based on the values of highestgradecompleted. If highestgradecompleted is null (NaN), we assign NaN to our new column, highschoolgrad. If the value for highestgradecompleted is not null, the next clause tests for a value less than 12, setting highschoolgrad to 0 if that is true, and to 1 otherwise. We can confirm that the new column, highschoolgrad, contains the values we want by using groupby to get the min and max values of highestgradecompleted at each level of highschoolgrad:

nls97abb['highschoolgrad'] = \
  np.where(nls97abb.highestgradecompleted.isnull(),np.nan, \
  np.where(nls97abb.highestgradecompleted<12,0,1))
 
nls97abb.groupby(['highschoolgrad'], dropna=False) \
  ['highestgradecompleted'].agg(['min','max','size'])
                  min       max       size
highschoolgrad                
0                   5        11       1231
1                  12        20       5421
nan               nan       nan       2332
 nls97abb['highschoolgrad'] = \
...  nls97abb['highschoolgrad'].astype('category')

While 12 makes conceptual sense as the threshold for classifying our new feature, highschoolgrad, this would present some modeling challenges if we intended to use highschoolgrad as a target. There is a pretty substantial class imbalance, with highschoolgrad equal to 1 class being more than 4 times the size of the 0 group. We should explore using more groups to represent highestgradecompleted.

One way to do this with pandas is with the qcut function. We can set the q parameter of qcut to 6 to create six groups that are as evenly distributed as possible. These groups are now closer to being balanced:

nls97abb['highgradegroup'] = \
  pd.qcut(nls97abb['highestgradecompleted'], 
   q=6, labels=[1,2,3,4,5,6])
 
nls97abb.groupby(['highgradegroup'])['highestgradecompleted'].\
    agg(['min','max','size'])
                  min         max      size
highgradegroup                
1                   5          11       1231
2                  12          12       1389
3                  13          14       1288
4                  15          16       1413
5                  17          17        388
6                  18          20        943
nls97abb['highgradegroup'] = \
    nls97abb['highgradegroup'].astype('category')

Finally, I typically find it helpful to generate frequencies for all the categorical features and save that output so that I can refer to it later. I rerun that code whenever I make some change to the data that may change these frequencies. The following code iterates over all the columns that are of the category data type and runs value_counts:

 freqout = open('views/frequencies.txt', 'w') 
 for col in nls97abb.select_dtypes(include=["category"]):
      print(col, "----------------------",
        "frequencies",
      nls97abb[col].value_counts(dropna=False).sort_index(),
        "percentages",
      nls97abb[col].value_counts(normalize=True).\
        sort_index(),
      sep="\n\n", end="\n\n\n", file=freqout)
 
 freqout.close()

These are the key techniques for generating one-way frequencies for the categorical features in your data. The real star of the show has been the value_counts method. We can use value_counts to create frequencies a Series at a time, use it with apply for multiple columns, or iterate over several columns and call value_counts each time. We have looked at examples of each in this section. Next, let's explore some techniques for examining the distribution of continuous features.

Data Cleaning and Exploration with Machine Learning

By : Michael Walker

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Related Content you might be interested in

Current Title:

Data Cleaning and Exploration with Machine Learning

Python Data Cleaning Cookbook

scikit-learn Cookbook

Data Science Projects with Python

Generating frequencies for categorical features