Book Image

Essential Statistics for Non-STEM Data Analysts

By : Rongpeng Li
Book Image

Essential Statistics for Non-STEM Data Analysts

By: Rongpeng Li

Overview of this book

Statistics remain the backbone of modern analysis tasks, helping you to interpret the results produced by data science pipelines. This book is a detailed guide covering the math and various statistical methods required for undertaking data science tasks. The book starts by showing you how to preprocess data and inspect distributions and correlations from a statistical perspective. You’ll then get to grips with the fundamentals of statistical analysis and apply its concepts to real-world datasets. As you advance, you’ll find out how statistical concepts emerge from different stages of data science pipelines, understand the summary of datasets in the language of statistics, and use it to build a solid foundation for robust data products such as explanatory models and predictive models. Once you’ve uncovered the working mechanism of data science algorithms, you’ll cover essential concepts for efficient data collection, cleaning, mining, visualization, and analysis. Finally, you’ll implement statistical methods in key machine learning tasks such as classification, regression, tree-based methods, and ensemble learning. By the end of this Essential Statistics for Non-STEM Data Analysts book, you’ll have learned how to build and present a self-contained, statistics-backed data product to meet your business goals.
Table of Contents (19 chapters)
1
Section 1: Getting Started with Statistics for Data Science
5
Section 2: Essentials of Statistical Analysis
10
Section 3: Statistics for Machine Learning
15
Section 4: Appendix

Data standardization – when and how

Data standardization is a common preprocessing step. I use the terms standardization and normalization interchangeably. You may also encounter the concept of rescaling in literature or blogs.

Standardization often means shifting the data to be zero-centered with a standard deviation of 1. The goal is to bring variables with different units/ranges down to the same range. Many machine learning tasks are sensitive to data magnitudes. Standardization is supposed to remove such factors.

Rescaling doesn't necessarily bring the variables to a common range. This is done by means of customized mapping, usually linear, to scale original data to a different range. However, the common approach of min-max scaling does transform different variables into a common range [0, 1].

People may argue about the difference between standardization and normalization. When comparing their differences, normalization will refer to normalizing different variables to the same range [0, 1], and min-max scaling is considered a normalization algorithm. However, there are other normalization algorithms as well. Standardization cares more about the mean and standard deviation.

Standardization also transforms the original distribution closer to a Gaussian distribution. In the event that the original distribution is indeed Gaussian, standardization outputs a standard Gaussian distribution.

When to perform standardization

Perform standardization when your downstream tasks require it. For example, the k-nearest neighbors method is sensitive to variable magnitudes, so you should standardize the data. On the other hand, tree-based methods are not sensitive to different ranges of variables, so standardization is not required.

There are mature libraries to perform standardization. We first calculate the standard deviation and mean of the data, subtract the mean from every entry, and then divide by the standard deviation. Standard deviation describes the level of variety in data that will be discussed more in Chapter 2, Essential Statistics for Data Assessment.

Here is an example involving vanilla Python:

stdChol = np.std(chol)
meanChol = np.mean(chol)
chol2 = chol.apply(lambda x: (x-meanChol)/stdChol)
plt.hist(chol2,bins=range(int(min(chol2)), int(max(chol2))+1, 1));

The output is as follows:

Figure 1.14 – Standardized cholesterol data

Figure 1.14 – Standardized cholesterol data

Note that the standardized distribution looks more like a Gaussian distribution now.

Data standardization is irreversible. Information will be lost in standardization. It is only recommended to do so when no original information, such as magnitudes or original standard deviation, will be required later. In most cases, standardization is a safe choice for most downstream data science tasks.

In the next section, we will use the scikit-learn preprocessing module to demonstrate tasks involving standardization.