Book Image

Data Cleaning and Exploration with Machine Learning

By : Michael Walker
Book Image

Data Cleaning and Exploration with Machine Learning

By: Michael Walker

Overview of this book

Many individuals who know how to run machine learning algorithms do not have a good sense of the statistical assumptions they make and how to match the properties of the data to the algorithm for the best results. As you start with this book, models are carefully chosen to help you grasp the underlying data, including in-feature importance and correlation, and the distribution of features and targets. The first two parts of the book introduce you to techniques for preparing data for ML algorithms, without being bashful about using some ML techniques for data cleaning, including anomaly detection and feature selection. The book then helps you apply that knowledge to a wide variety of ML tasks. You’ll gain an understanding of popular supervised and unsupervised algorithms, how to prepare data for them, and how to evaluate them. Next, you’ll build models and understand the relationships in your data, as well as perform cleaning and exploration tasks with that data. You’ll make quick progress in studying the distribution of variables, identifying anomalies, and examining bivariate relationships, as you focus more on the accuracy of predictions in this book. By the end of this book, you’ll be able to deal with complex data problems using unsupervised ML algorithms like principal component analysis and k-means clustering.
Table of Contents (23 chapters)
1
Section 1 – Data Cleaning and Machine Learning Algorithms
5
Section 2 – Preprocessing, Feature Selection, and Sampling
9
Section 3 – Modeling Continuous Targets with Supervised Learning
13
Section 4 – Modeling Dichotomous and Multiclass Targets with Supervised Learning
19
Section 5 – Clustering and Dimensionality Reduction with Unsupervised Learning

What this book covers

Chapter 1, Examining the Distribution of Features and Targets, explores using common NumPy and pandas techniques to get a better sense of the attributes of our data. We will generate summary statistics, such as mean, min, and max, and standard deviation, and count the number of missings. We will also create visualizations of key features, including histograms and boxplots, to give us a better sense of the distribution of each feature than we can get by just looking at summary statistics. We will hint at the implications of feature distribution for data transformation, encoding and scaling, and the modeling that we will be doing in subsequent chapters with the same data.

Chapter 2, Examining Bivariate and Multivariate Relationships between Features and Targets, focuses on the correlation between possible features and target variables. We will use pandas methods for bivariate analysis, and Matplotlib for visualizations. We will discuss the implications of what we find for feature engineering and modeling. We also use multivariate techniques in this chapter to understand the relationship between features.

Chapter 3, Identifying and Fixing Missing Values, goes over techniques for identifying missing values for each feature or target, and for identifying observations where values for a large number of the features are absent. We will explore strategies for imputing values, such as setting values to the overall mean, to the mean for a given category, or forward filling. We will also examine multivariate techniques for imputing values for missings and discuss when they are appropriate.

Chapter 4, Encoding, Transforming, and Scaling Features, covers a range of feature engineering techniques. We will use tools to drop redundant or highly correlated features. We will explore the most common kinds of encoding – one-hot, ordinal, and hashing encoding. We will also use transformations to improve the distribution of our features. Finally, we will use common binning and scaling approaches to address skew, kurtosis, and outliers, and to adjust for features with widely different ranges.

Chapter 5, Feature Selection will go over a number of feature selection methods, from filter, to wrapper, to embedded methods. We will explore how they work with categorical and continuous targets. For wrapper and embedded methods, we consider how well they work with different algorithms.

Chapter 6, Preparing for Model Evaluation, will see us build our first full-fledged pipeline, separating our data into testing and training datasets, and learning how to do preprocessing without data leakage. We will implement cross-validation with k-fold and look more closely into assessing the performance of our models.

Chapter 7, Linear Regression Models, is the first of several chapters on building regression models with an old favorite of many data scientists, linear regression. We will run a classical linear model while also examining the qualities of a feature space that make it a good candidate for a linear model. We will explore how to improve linear models, when necessary, with regularization and transformations. We will look into stochastic gradient descent as an alternative to ordinary least square (OLS) optimization. We will also learn how to do hyperparameter tuning with grid searches.

Chapter 8, Support Vector Regression, discusses key support vector machine concepts and how they can be applied to regression problems. In particular, we will examine how concepts such as epsilon-insensitive tubes and soft margins can give us the flexibility to get the best fit possible, given our data and domain-related challenges. We will also explore, for the first time but definitely not the last, the very handy kernel trick, which allows us to model nonlinear relationships without transformations or increasing the number of features.

Chapter 9, K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression, explores some of the most popular non-parametric regression algorithms. We will discuss the advantages of each algorithm, when you might want to choose one over the other, and possible modeling challenges. These challenges include how to avoid underfitting and overfitting with careful adjusting of hyperparameters.

Chapter 10, Logistic Regression, is the first of several chapters on building classification models with logistic regression, an efficient algorithm with low bias. We will carefully examine the assumptions of logistic regression and discuss the attributes of a dataset and a modeling problem that make logistic regression a good choice. We will use regularization to address high variance or when we have a number of highly correlated predictors. We will extend the algorithm to multiclass problems with multinomial logistic regression. We will also discuss how to handle class imbalance for the first, but not the last, time.

Chapter 11, Decision Trees and Random Forest Classification, returns to the decision tree and random forest algorithms that were introduced in Chapter 9, K-Nearest Neighbors, Decision Tree, Random Forest, and Gradient Boosted Regression, this time dealing with classification problems. This gives us another opportunity to learn how to construct and interpret decision trees. We will adjust key hyperparameters, including the depth of trees, to avoid overfitting. We will then explore random forest and gradient boosted decision trees as good, lower variance alternatives to decision trees.

Chapter 12, K-Nearest Neighbors for Classification, returns to k-nearest neighbors (KNNs) to handle both binary and multiclass modeling problems. We will discuss and demonstrate the advantages of KNN – how easy it is to build a no-frills model and the limited number of hyperparameters to adjust. By the end of the chapter, we will know both – how to do KNN and when we should consider it for our modeling.

Chapter 13, Support Vector Machine Classification, explores different strategies for implementing support vector classification (SVC). We will use linear SVC, which can perform very well when our classes are linearly separable. We will then examine how to use the kernel trick to extend SVC to cases where the classes are not linearly separable. Finally, we will use one-versus-one and one-versus-rest classification to handle targets with more than two values.

Chapter 14, Naïve Bayes Classification, discusses the fundamental assumptions of naïve Bayes in this chapter and how the algorithm is used to tackle some of the modeling challenges we have already explored, as well as some new ones, such as text classification. We will consider when naïve Bayes is a good option and when it is not. We will also examine the interpretation of naïve Bayes models.

Chapter 15, Principal Component Analysis, examines principal component analysis (PCA), including how it works and when we might want to use it. We will learn how to interpret the components created from PCA, including how each feature contributes to each component and how much of the variance is explained. We will learn how to visualize components and how to use components in subsequent analyses. We will also examine how to use kernels for PCA and when that might give us better results.

Chapter 16, K-Means and DBSCAN Clustering, explores two popular clustering techniques, k-means and Density-based spatial clustering of applications with noise (DBSCAN). We will discuss the strengths of each approach and develop a sense of when to choose one clustering algorithm over the other. We will also learn how to evaluate our clusters and how to change hyperparameters to improve our model.