Data Cleaning and Exploration with Machine Learning

By : Michael Walker

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results. As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book. By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1 – Data Cleaning and Machine Learning Algorithms

Free Chapter

Chapter 1: Examining the Distribution of Features and Targets

Technical requirements

Subsetting data

Generating frequencies for categorical features

Generating summary statistics for continuous and discrete features

Identifying extreme values and outliers in univariate analysis

Using histograms, boxplots, and violin plots to examine the distribution of features

Summary

Chapter 2: Examining Bivariate and Multivariate Relationships between Features and Targets

Technical requirements

Identifying outliers and extreme values in bivariate relationships

Using scatter plots to view bivariate relationships between continuous features

Using grouped boxplots to view bivariate relationships between continuous and categorical features

Using linear regression to identify data points with significant influence

Using K-nearest neighbors to find outliers

Using Isolation Forest to find outliers

Summary

Chapter 3: Identifying and Fixing Missing Values

Technical requirements

Identifying missing values

Cleaning missing values

Imputing values with regression

Using KNN imputation

Using random forest for imputation

Summary

Section 2 – Preprocessing, Feature Selection, and Sampling

Chapter 4: Encoding, Transforming, and Scaling Features

Technical requirements

Creating training datasets and avoiding data leakage

Removing redundant or unhelpful features

Encoding categorical features

Encoding categorical features with medium or high cardinality

Using mathematical transformations

Feature binning

Feature scaling

Summary

Chapter 5: Feature Selection

Technical requirements

Selecting features for classification models

Selecting features for regression models

Using forward and backward feature selection

Using exhaustive feature selection

Eliminating features recursively in a regression model

Eliminating features recursively in a classification model

Using Boruta for feature selection

Using regularization and other embedded methods

Using principal component analysis

Summary

Chapter 6: Preparing for Model Evaluation

Technical requirements

Measuring accuracy, sensitivity, specificity, and precision for binary classification

Examining CAP, ROC, and precision-sensitivity curves for binary classification

Evaluating multiclass models

Evaluating regression models

Using K-fold cross-validation

Preprocessing data with pipelines

Summary

Section 3 – Modeling Continuous Targets with Supervised Learning

Chapter 7: Linear Regression Models

Technical requirements

Key concepts

Linear regression and gradient descent

Using classical linear regression

Using lasso regression

Using non-linear regression

Regression with gradient descent

Summary

Chapter 8: Support Vector Regression

Technical requirements

Key concepts of SVR

SVR with a linear model

Using kernels for nonlinear SVR

Summary

Chapter 9: K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression

Technical requirements

Key concepts for K-nearest neighbors regression

K-nearest neighbors regression

Key concepts for decision tree and random forest regression

Decision tree and random forest regression

Using gradient boosted regression

Summary

Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning

Chapter 10: Logistic Regression

Technical requirements

Key concepts of logistic regression

Binary classification with logistic regression

Regularization with logistic regression

Multinomial logistic regression

Summary

Chapter 11: Decision Trees and Random Forest Classification

Technical requirements

Key concepts

Decision tree models

Implementing random forest

Implementing gradient boosting

Summary

Chapter 12: K-Nearest Neighbors for Classification

Technical requirements

Key concepts of KNN

KNN for binary classification

KNN for multiclass classification

Summary

Chapter 13: Support Vector Machine Classification

Technical requirements

Key concepts for SVC

Linear SVC models

Nonlinear SVM classification models

SVMs for multiclass classification

Summary

Chapter 14: Naïve Bayes Classification

Technical requirements

Key concepts

Naïve Bayes classification models

Naïve Bayes for text classification

Summary

Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

Chapter 15: Principal Component Analysis

Technical requirements

Key concepts of PCA

Feature extraction with PCA

Using kernels with PCA

Summary

Chapter 16: K-Means and DBSCAN Clustering

Technical requirements

The key concepts of k-means and DBSCAN clustering

Implementing k-means clustering

Implementing DBSCAN clustering

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What this book covers

Chapter 1, Examining the Distribution of Features and Targets, explores using common NumPy and pandas techniques to get a better sense of the attributes of our data. We will generate summary statistics, such as mean, min, and max, and standard deviation, and count the number of missings. We will also create visualizations of key features, including histograms and boxplots, to give us a better sense of the distribution of each feature than we can get by just looking at summary statistics. We will hint at the implications of feature distribution for data transformation, encoding and scaling, and the modeling that we will be doing in subsequent chapters with the same data.

Chapter 2, Examining Bivariate and Multivariate Relationships between Features and Targets, focuses on the correlation between possible features and target variables. We will use pandas methods for bivariate analysis, and Matplotlib for visualizations. We will discuss the implications of what we find for feature engineering and modeling. We also use multivariate techniques in this chapter to understand the relationship between features.

Chapter 3, Identifying and Fixing Missing Values, goes over techniques for identifying missing values for each feature or target, and for identifying observations where values for a large number of the features are absent. We will explore strategies for imputing values, such as setting values to the overall mean, to the mean for a given category, or forward filling. We will also examine multivariate techniques for imputing values for missings and discuss when they are appropriate.

Chapter 4, Encoding, Transforming, and Scaling Features, covers a range of feature engineering techniques. We will use tools to drop redundant or highly correlated features. We will explore the most common kinds of encoding – one-hot, ordinal, and hashing encoding. We will also use transformations to improve the distribution of our features. Finally, we will use common binning and scaling approaches to address skew, kurtosis, and outliers, and to adjust for features with widely different ranges.

Chapter 5, Feature Selection will go over a number of feature selection methods, from filter, to wrapper, to embedded methods. We will explore how they work with categorical and continuous targets. For wrapper and embedded methods, we consider how well they work with different algorithms.

Chapter 6, Preparing for Model Evaluation, will see us build our first full-fledged pipeline, separating our data into testing and training datasets, and learning how to do preprocessing without data leakage. We will implement cross-validation with k-fold and look more closely into assessing the performance of our models.

Chapter 7, Linear Regression Models, is the first of several chapters on building regression models with an old favorite of many data scientists, linear regression. We will run a classical linear model while also examining the qualities of a feature space that make it a good candidate for a linear model. We will explore how to improve linear models, when necessary, with regularization and transformations. We will look into stochastic gradient descent as an alternative to ordinary least square (OLS) optimization. We will also learn how to do hyperparameter tuning with grid searches.

Chapter 8, Support Vector Regression, discusses key support vector machine concepts and how they can be applied to regression problems. In particular, we will examine how concepts such as epsilon-insensitive tubes and soft margins can give us the flexibility to get the best fit possible, given our data and domain-related challenges. We will also explore, for the first time but definitely not the last, the very handy kernel trick, which allows us to model nonlinear relationships without transformations or increasing the number of features.

Chapter 9, K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression, explores some of the most popular non-parametric regression algorithms. We will discuss the advantages of each algorithm, when you might want to choose one over the other, and possible modeling challenges. These challenges include how to avoid underfitting and overfitting with careful adjusting of hyperparameters.

Chapter 10, Logistic Regression, is the first of several chapters on building classification models with logistic regression, an efficient algorithm with low bias. We will carefully examine the assumptions of logistic regression and discuss the attributes of a dataset and a modeling problem that make logistic regression a good choice. We will use regularization to address high variance or when we have a number of highly correlated predictors. We will extend the algorithm to multiclass problems with multinomial logistic regression. We will also discuss how to handle class imbalance for the first, but not the last, time.

Chapter 11, Decision Trees and Random Forest Classification, returns to the decision tree and random forest algorithms that were introduced in Chapter 9, K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression, this time dealing with classification problems. This gives us another opportunity to learn how to construct and interpret decision trees. We will adjust key hyperparameters, including the depth of trees, to avoid overfitting. We will then explore random forest and gradient boosted decision trees as good, lower variance alternatives to decision trees.

Chapter 12, K-Nearest Neighbors for Classification, returns to k-nearest neighbors (KNNs) to handle both binary and multiclass modeling problems. We will discuss and demonstrate the advantages of KNN – how easy it is to build a no-frills model and the limited number of hyperparameters to adjust. By the end of the chapter, we will know both – how to do KNN and when we should consider it for our modeling.

Chapter 13, Support Vector Machine Classification, explores different strategies for implementing support vector classification (SVC). We will use linear SVC, which can perform very well when our classes are linearly separable. We will then examine how to use the kernel trick to extend SVC to cases where the classes are not linearly separable. Finally, we will use one-versus-one and one-versus-rest classification to handle targets with more than two values.

Chapter 14, Naïve Bayes Classification, discusses the fundamental assumptions of naïve Bayes in this chapter and how the algorithm is used to tackle some of the modeling challenges we have already explored, as well as some new ones, such as text classification. We will consider when naïve Bayes is a good option and when it is not. We will also examine the interpretation of naïve Bayes models.

Chapter 15, Principal Component Analysis, examines principal component analysis (PCA), including how it works and when we might want to use it. We will learn how to interpret the components created from PCA, including how each feature contributes to each component and how much of the variance is explained. We will learn how to visualize components and how to use components in subsequent analyses. We will also examine how to use kernels for PCA and when that might give us better results.

Chapter 16, K-Means and DBSCAN Clustering, explores two popular clustering techniques, k-means and Density-based spatial clustering of applications with noise (DBSCAN). We will discuss the strengths of each approach and develop a sense of when to choose one clustering algorithm over the other. We will also learn how to evaluate our clusters and how to change hyperparameters to improve our model.

Data Cleaning and Exploration with Machine Learning

By : Michael Walker

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Related Content you might be interested in

Current Title:

Data Cleaning and Exploration with Machine Learning

Python Data Cleaning Cookbook

scikit-learn Cookbook

Data Science Projects with Python

What this book covers