Examples involving the scikit-learn preprocessing module

For both imputation and standardization, scikit-learn offers similar APIs:

First, fit the data to learn the imputer or standardizer.
Then, use the fitted object to transform new data.

In this section, I will demonstrate two examples, one for imputation and another for standardization.

Note

Scikit-learn uses the same syntax of fit and predict for predictive models. This is a very good practice for keeping the interface consistent. We will cover the machine learning methods in later chapters.

Imputation

First, create an imputer from the SimpleImputer class. The initialization of the instance allows you to choose missing value forms. It is handy as we can feed our original data into it by treating the question mark as a missing value:

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")

Note that fit and transform can accept the same input:

imputer.fit(df2)
df3 = pd.DataFrame(imputer.transform(df2))

Now, check the number of missing values – the result should be 0:

np.sum(np.sum(np.isnan(df3)))

Standardization

Standardization can be implemented in a similar fashion:

from sklearn import preprocessing

The scale function provides the default zero-mean, one-standard deviation transformation:

df4 = pd.DataFrame(preprocessing.scale(df2))

Note

In this example, categorical variables represented by integers are also zero-mean, which should be avoided in production.

Let's check the standard deviation and mean. The following line outputs infinitesimal values:

df4.mean(axis=0)

The following line outputs values close to 1:

df4.std(axis=0)

Let's look at an example of MinMaxScaler, which transforms every variable into the range [0, 1]. The following code fits and transforms the heart disease dataset in one step. It is left to you to examine its validity:

minMaxScaler = preprocessing.MinMaxScaler()
df5 = pd.DataFrame(minMaxScaler.fit_transform(df2))

Let's now summarize what we have learned in this chapter.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Essential Statistics for Non-STEM Data Analysts

By : Li

Essential Statistics for Non-STEM Data Analysts

By: Li

Overview of this book

Examples involving the scikit-learn preprocessing module

Imputation

Standardization

Essential Statistics for Non-STEM Data Analysts

By : Li

Essential Statistics for Non-STEM Data Analysts

By: Li

Overview of this book

Examples involving the scikit-learn preprocessing module

Imputation

Standardization

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access