Book Image

Essential Statistics for Non-STEM Data Analysts

By : Rongpeng Li
Book Image

Essential Statistics for Non-STEM Data Analysts

By: Rongpeng Li

Overview of this book

Statistics remain the backbone of modern analysis tasks, helping you to interpret the results produced by data science pipelines. This book is a detailed guide covering the math and various statistical methods required for undertaking data science tasks. The book starts by showing you how to preprocess data and inspect distributions and correlations from a statistical perspective. You’ll then get to grips with the fundamentals of statistical analysis and apply its concepts to real-world datasets. As you advance, you’ll find out how statistical concepts emerge from different stages of data science pipelines, understand the summary of datasets in the language of statistics, and use it to build a solid foundation for robust data products such as explanatory models and predictive models. Once you’ve uncovered the working mechanism of data science algorithms, you’ll cover essential concepts for efficient data collection, cleaning, mining, visualization, and analysis. Finally, you’ll implement statistical methods in key machine learning tasks such as classification, regression, tree-based methods, and ensemble learning. By the end of this Essential Statistics for Non-STEM Data Analysts book, you’ll have learned how to build and present a self-contained, statistics-backed data product to meet your business goals.
Table of Contents (19 chapters)
1
Section 1: Getting Started with Statistics for Data Science
5
Section 2: Essentials of Statistical Analysis
10
Section 3: Statistics for Machine Learning
15
Section 4: Appendix

Data imputation

Missing data is ubiquitous and data imputation techniques will help us to alleviate its influence.

In this section, we are going to use the heart disease data to examine the pros and cons of basic data imputation. I recommend you read the dataset description beforehand to understand the meaning of each column.

Preparing the dataset for imputation

The heart disease dataset is the same one we used earlier in the Collecting data from various data sources section. It should give you a real red flag that you shouldn't take data integrity for granted. The following screenshot shows missing data denoted by question marks:

Figure 1.2 – The head of Hungarian heart disease data in VS Code (CSV rainbow extension enabled)

Figure 1.2 – The head of Hungarian heart disease data in VS Code (CSV rainbow extension enabled)

First, let's do an info() call that lists column data type information:

df.info()

Note

df.info() is a very helpful function that provides you with pointers for your next move. It should be the first function call when given an unknown dataset.

The following screenshot shows the output obtained from the preceding function:

Figure 1.3 – Output of the info() function call

Figure 1.3 – Output of the info() function call

If pandas can't infer the data type of a column, it will interpret it as objects. For example, the chol (cholesterol) column contains missing data. The missing data is a question mark treated as a string, but the remainder of the data is of the float type. The records are collectively called objects.

Python's type tolerance

As Python is pretty error-tolerant, it is a good practice to introduce a necessary type check. For example, if a column mixes the numerical values, instead of using numerical values to check truth, explicitly check its type and write two branches. Also, it is advised to avoid type conversion on columns with data type objects. Remember to make your code completely deterministic and future-proof.

Now, let's replace the question mark with the NaN values. The following code snippet declares a function that can handle three different cases and treat them appropriately. The three cases are listed here:

  • The record value is "?".
  • The record value is of the integer type. This is treated independently because columns such as num should be binary. Floating numbers will lose the essence of using 0-1 encoding.
  • The rest includes valid strings that can be converted to float numbers and original float numbers.

The code snippet will be as follows:

import numpy as np
def replace_question_mark(val):
    if val == "?":
        return np.NaN
    elif type(val)==int:
        return val
    else:
        return float(val)
df2 = df.copy()
for (columnName, _) in df2.iteritems():
    df2[columnName] = df2[columnName].apply(replace-question_mark)

Now we call the info() function and the head() function, as shown here:

df2.info()

You should expect that all fields are now either floats or integers, as shown in the following output:

Figure 1.4 – Output of info() after data type conversion

Figure 1.4 – Output of info() after data type conversion

Now you can check the number of non-null entries for each column, and different columns have different levels of completeness. age and sex don't contain missing values, but ca contains almost no valid data. This should guide you on your choices of data imputation. For example, strictly dropping all the missing values, which is also considered a way of data imputation, will almost remove the complete dataset. Let's check the shape of the DataFrame after the default missing value drops. You see that there is only one row left. We don't want it:

df2.dropna().shape

A screenshot of the output is as follows:

Figure 1.5 – Removing records containing NaN values leaves only one entry

Figure 1.5 – Removing records containing NaN values leaves only one entry

Before moving on to other more mainstream imputation methods, we would love to perform a quick review of our processed DataFrame.

Check the head of the new DataFrame. You should see that all question marks are replaced by NaN values. NaN values are treated as legitimate numerical values, so native NumPy functions can be used on them:

df2.head()

The output should look as follows:

Figure 1.6 – The head of the updated DataFrame

Figure 1.6 – The head of the updated DataFrame

Now, let's call the describe() function, which generates a table of statistics. It is a very helpful and handy function for a quick peak at common statistics in our dataset:

df2.describe()

Here is a screenshot of the output:

Figure 1.7 – Output from the describe() call

Figure 1.7 – Output from the describe() call

Understanding the describe() limitation

Note that the describe() function only considers valid values. In this sample, the average age value is more trustworthy than the average thal value. Do also pay attention to the metadata. A numerical value doesn't necessarily have a numerical meaning. For example, a thal value is encoded to integers with given meanings.

Now, let's examine the two most common ways of imputation.

Imputation with mean or median values

Imputation with mean or median values only works on numerical datasets. Categorical variables don't contain structures, such as one label being larger than another. Therefore, the concepts of mean and median won't apply.

There are several advantages associated with mean/median imputation:

  • It is easy to implement.
  • Mean/median imputation doesn't introduce extreme values.
  • It does not have any time limit.

However, there are some statistical consequences of mean/median imputation. The statistics of the dataset will change. For example, the histogram for cholesterol prior to imputation is provided here:

Figure 1.8 – Cholesterol concentration distribution

Figure 1.8 – Cholesterol concentration distribution

The following code snippet does the imputation with the mean. Following imputation the with mean, the histogram shifts to the right a little bit:

chol = df2["chol"]
plt.hist(chol.apply(lambda x: np.mean(chol) if np.isnan(x) else x), bins=range(0,630,30))
plt.xlabel("cholesterol imputation")
plt.ylabel("count")
Figure 1.9 – Cholesterol concentration distribution with mean imputation

Figure 1.9 – Cholesterol concentration distribution with mean imputation

Imputation with the median will shift the peak to the left because the median is smaller than the mean. However, it won't be obvious if you enlarge the bin size. Median and mean values will likely fall into the same bin in this eventuality:

Figure 1.10 – Cholesterol imputation with median imputation

Figure 1.10 – Cholesterol imputation with median imputation

The good news is that the shape of the distribution looks rather similar. The bad news is that we probably increased the level of concentration a little bit. We will cover such statistics in Chapter 3, Visualization with Statistical Graphs.

Note

In other cases where the distribution is not centered or contains a substantial ratio of missing data, such imputation can be disastrous. For example, if the waiting time in a restaurant follows an exponential distribution, imputation with mean values will probably break the characteristics of the distribution.

Imputation with the mode/most frequent value

The advantage of using the most frequent value is that it works well with categorical features and, without a doubt, it will introduce bias as well. The slope field is categorical in nature, although it looks numerical. It represents three statuses of a slope value as positive, flat, or negative.

The following code snippet will reveal our observation:

plt.hist(df2["slope"],bins = 5)
plt.xlabel("slope")
plt.ylabel("count");

Here is the output:

Figure 1.11 – Counting of the slope variable

Figure 1.11 – Counting of the slope variable

Without a doubt, the mode is 2. Following imputation with the mode, we obtain the following new distribution:

plt.hist(df2["slope"].apply(lambda x: 2 if np.isnan(x) else x),bins=5)
plt.xlabel("slope mode imputation")
plt.ylabel("count");

In the following graph, pay attention to the scale on y:

Figure 1.12 – Counting of the slope variable after mode imputation

Figure 1.12 – Counting of the slope variable after mode imputation

Replacing missing values with the mode in this case is disastrous. If positive and negative values of slope have medical consequences, performing prediction tasks on the preprocessed dataset will depress their weights and significance.

Different imputation methods have their own pros and cons. The prerequisite is to fully understand your business goals and downstream tasks. If key statistics are important, you should try to avoid distorting them. Also, do remember that collecting more data is always an option.