Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying The Regularization Cookbook
  • Table Of Contents Toc
The Regularization Cookbook

The Regularization Cookbook

By : Vincent Vandenbussche
4.3 (7)
close
close
The Regularization Cookbook

The Regularization Cookbook

4.3 (7)
By: Vincent Vandenbussche

Overview of this book

Regularization is an infallible way to produce accurate results with unseen data, however, applying regularization is challenging as it is available in multiple forms and applying the appropriate technique to every model is a must. The Regularization Cookbook provides you with the appropriate tools and methods to handle any case, with ready-to-use working codes as well as theoretical explanations. After an introduction to regularization and methods to diagnose when to use it, you’ll start implementing regularization techniques on linear models, such as linear and logistic regression, and tree-based models, such as random forest and gradient boosting. You’ll then be introduced to specific regularization methods based on data, high cardinality features, and imbalanced datasets. In the last five chapters, you’ll discover regularization for deep learning models. After reviewing general methods that apply to any type of neural network, you’ll dive into more NLP-specific methods for RNNs and transformers, as well as using BERT or GPT-3. By the end, you’ll explore regularization for computer vision, covering CNN specifics, along with the use of generative models such as stable diffusion and Dall-E. By the end of this book, you’ll be armed with different regularization techniques to apply to your ML and DL models.
Table of Contents (14 chapters)
close
close

Preparing quantitative data

Depending on the type of data, how the features must be prepared may differ. In this recipe, we’ll cover how to prepare quantitative data, including missing data imputation and rescaling.

Getting ready

In the Titanic dataset, as well as any other dataset, there may be missing data. There are several ways to deal with missing data. For example, you can drop a column or a row, or impute a value. There are many imputation techniques, some of which are more or less sophisticated. scikit-learn supplies several implementations of imputers, such as SimpleImputer and KNNImputer.

As we will see in this recipe, using SimpleImputer, we can impute the missing quantitative data with the mean value.

Once the missing data has been handled, we can prepare the quantitative data by rescaling it so that all the data is at the same scale.

Several rescaling strategies exist, such as min-max scaling, robust scaling, standard scaling, and others.

In this recipe, we will use standard scaling. So, for each feature, we will subtract the mean value of this feature, and then divide it by the standard deviation of that feature:

Fortunately, scikit-learn provides a fully working implementation via StandardScaler.

How to do it…

We will sequentially handle missing values and rescale the data in this recipe:

  1. Import the required classes – SimpleImputer for missing data imputation and StandardScaler for rescaling:
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler
  2. Select the quantitative features we want to keep. Here, we will keep 'Pclass', 'Age', 'Fare', 'SibSp', and 'Parch' and store these features in new variables for both the train and test sets:
    quanti_columns = ['Pclass', 'Age', 'Fare', 'SibSp', 'Parch']
    # Get the quantitative columns
    X_train_quanti = X_train[quanti_columns]
    X_test_quanti = X_test[quanti_columns]
  3. Instantiate the simple imputer with a mean strategy. Here, the missing value of a feature will be replaced with the mean value of that feature:
    # Impute missing quantitative values with mean feature value
    quanti_imputer = SimpleImputer(strategy='mean')
  4. Fit the imputer on the train set and apply it to the test set so that it avoids leakage in the imputation:
    # Fit and impute the training set
    X_train_quanti = quanti_imputer.fit_transform(X_train_quanti)
    # Just impute the test set
    X_test_quanti = quanti_imputer.transform(X_test_quanti)
  5. Now that imputation has been performed, instantiate the scaler object:
    # Instantiate the standard scaler
    scaler = StandardScaler()
  6. Finally, fit and apply the standard scaler to the train set, and then apply it to the test set:
    # Fit and transform the training set
    X_train_quanti = scaler.fit_transform(X_train_quanti)
    # Just transform the test set
    X_test_quanti = scaler.transform(X_test_quanti)

We now have quantitative data with no missing values, fully rescaled, with no data leakage.

There’s more…

In this recipe, we used the simple imputer, assuming there was missing data. In practice, it is highly recommended that you look at the data first to check whether there are missing values, as well as how many. It is possible to look at the number of missing values per column with the following code snippet:

# Display the number of missing data for each column
X_train[quanti_columns].isna().sum()

This will output the following:

Pclass        0
Age         146
Fare           0
SibSp         0
Parch         0

Thanks to this, we know that the Age feature has 146 missing values, while the other features have no missing data.

See also

A few imputers are available in scikit-learn. The list is available here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute.

There are many ways to scale data, and you can find the methods that are available in scikit-learn here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing.

You might be interested in looking at this comparison of several scalers on some given data: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
The Regularization Cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon