Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying The Regularization Cookbook
  • Table Of Contents Toc
The Regularization Cookbook

The Regularization Cookbook

By : Vincent Vandenbussche
4.3 (7)
close
close
The Regularization Cookbook

The Regularization Cookbook

4.3 (7)
By: Vincent Vandenbussche

Overview of this book

Regularization is an infallible way to produce accurate results with unseen data, however, applying regularization is challenging as it is available in multiple forms and applying the appropriate technique to every model is a must. The Regularization Cookbook provides you with the appropriate tools and methods to handle any case, with ready-to-use working codes as well as theoretical explanations. After an introduction to regularization and methods to diagnose when to use it, you’ll start implementing regularization techniques on linear models, such as linear and logistic regression, and tree-based models, such as random forest and gradient boosting. You’ll then be introduced to specific regularization methods based on data, high cardinality features, and imbalanced datasets. In the last five chapters, you’ll discover regularization for deep learning models. After reviewing general methods that apply to any type of neural network, you’ll dive into more NLP-specific methods for RNNs and transformers, as well as using BERT or GPT-3. By the end, you’ll explore regularization for computer vision, covering CNN specifics, along with the use of generative models such as stable diffusion and Dall-E. By the end of this book, you’ll be armed with different regularization techniques to apply to your ML and DL models.
Table of Contents (14 chapters)
close
close

Splitting data

After loading data, splitting it is a crucial step. This recipe will explain why we need to split data, as well as how to do it.

Getting ready

Why do we need to split data? An ML model is quite like a student.

You provide a student with many lectures and exercises, with or without the answers. But more often than not, students are evaluated on a completely new problem. To make sure they fully understand the concepts and methods, they not only learn the exercises and solutions – they also understand the underlying concepts.

An ML model is no different: you train the model on training data and then evaluate it on test data. This way, you make sure the model fully understands the task and generalizes well to new, unseen data.

So, the dataset is usually split into train and test sets:

  • The train set must be as large as possible to give as many samples as possible to the model
  • The test set must be large enough to be statistically significant in evaluating the model

Typical splits can be anywhere between 80% to 20% for rather small datasets (for example, hundreds of samples), and 99% to 1% for very large datasets (for example, millions of samples and more).

For this recipe and the others in this chapter, it is assumed that the code has been executed in the same notebook as the previous recipe since each recipe reuses the code from the previous ones.

How to do it…

Here are the steps to try out this recipe:

  1. You can split the data rather easily with scikit-learn and the train_test_split() function:
    # Import the train_test_split function
    from sklearn.model_selection import train_test_split
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop(columns=['Survived']), df['Survived'],
        test_size=0.2, stratify=df['Survived'],
        random_state=0)

This function uses the following parameters as input:

  • X: All columns but the 'Survived' label
  • y: The 'Survived' label column
  • test_size: This is 0.2, which means the training size will be 80%
  • stratify: This specifies the 'Survived' column to ensure the same label balance is used in both splits
  • random_state: 0 is any integer to ensure reproducibility

It returns the following outputs:

  • X_train: The train split of X
  • X_test: The test split of X
  • y_train: The training split of y, associated with X_train
  • y_test: The test split of y, associated with X_test

Note

The stratify option is not mandatory but can be critical to ensure a balanced split of any qualitative feature, not just the labels, as is the case with imbalanced data.

This split should be done as early as possible when performing data processing so that you avoid any potential data leakage. From now on, all the preprocessing will be computed on the train set, and only then applied to the test set, in agreement with Figure 2.2.

See also

See the official documentation for the train_test_split function: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
The Regularization Cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon