Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying The Regularization Cookbook
  • Table Of Contents Toc
The Regularization Cookbook

The Regularization Cookbook

By : Vincent Vandenbussche
4.3 (7)
close
close
The Regularization Cookbook

The Regularization Cookbook

4.3 (7)
By: Vincent Vandenbussche

Overview of this book

Regularization is an infallible way to produce accurate results with unseen data, however, applying regularization is challenging as it is available in multiple forms and applying the appropriate technique to every model is a must. The Regularization Cookbook provides you with the appropriate tools and methods to handle any case, with ready-to-use working codes as well as theoretical explanations. After an introduction to regularization and methods to diagnose when to use it, you’ll start implementing regularization techniques on linear models, such as linear and logistic regression, and tree-based models, such as random forest and gradient boosting. You’ll then be introduced to specific regularization methods based on data, high cardinality features, and imbalanced datasets. In the last five chapters, you’ll discover regularization for deep learning models. After reviewing general methods that apply to any type of neural network, you’ll dive into more NLP-specific methods for RNNs and transformers, as well as using BERT or GPT-3. By the end, you’ll explore regularization for computer vision, covering CNN specifics, along with the use of generative models such as stable diffusion and Dall-E. By the end of this book, you’ll be armed with different regularization techniques to apply to your ML and DL models.
Table of Contents (14 chapters)
close
close

Loading data

The primary focus of this recipe is to load data from a CSV file. However, this is not the only thing that this recipe covers. Since the data is usually the first step in any ML project, this recipe is also a good opportunity to give a quick recap of the ML workflow, as well as the different types of data.

Getting ready

Before loading the data, we should keep in mind that an ML model follows a two-step process:

  1. Train a model on a given dataset to create a new model.
  2. Reuse the previously trained model to infer predictions on new data.

These two steps are summarized in the following figure:

Figure 2.1 – A simple view of the two-step ML process

Figure 2.1 – A simple view of the two-step ML process

Of course, in most cases, this is a rather simplistic view. A more detailed view can be seen in Figure 2.2:

Figure 2.2 – A more complete view of the ML process

Figure 2.2 – A more complete view of the ML process

Let’s take a closer look at the training part of the ML process shown in Figure 2.2:

  1. First, training data is queried from a data source (this can be a database, a data lake, an open dataset, and so on).
  2. The data is preprocessed, such as via feature engineering, rescaling, and so on.
  3. A model is trained and stored (on a data lake, locally, on the edge, and so on).
  4. Optionally, the output of this model is post-processed – for example, via formatting, heuristics, business rules, and more.
  5. Optionally again, this model (with or without postprocessing) is stored in a database for later reference or evaluation if needed.

Now, let’s take a look at the inference part of the ML process:

  1. The data is queried from a data source (a database, an API query, and so on).
  2. The data goes through the same preprocessing step as the training data.
  3. The trained model is fetched if it doesn’t already exist locally.
  4. The model is used to infer output.
  5. Optionally, the output of the model is post-processed via the same post-processing step as the training data.
  6. Optionally, the output is stored in a database for monitoring and later reference.

Even in this schema, many steps were not mentioned: splitting data for training purposes, using evaluation metrics, cross-validation, hyperparameter optimization, and others. This chapter will dive into the more training-specific steps and apply them to the very common but practical Titanic dataset, a binary classification problem. But first, we need to load the data.

To do so, you must download the Titanic dataset training set locally. This can be performed with the following command line:

wget https://github.com/PacktPublishing/The-Regularization-Cookbook/blob/main/chapter_02/train.csv

How to do it…

This recipe is about loading a CSV file and displaying a few lines of code so that we can have a first glance at what it is about:

  1. The first step is to import the required libraries. Here, the only library we need is pandas:
    import pandas as pd
  2. Now, we can load the data using the read_csv function provided by pandas. The first argument is the path to the file. Assuming the file is named train.csv and located in the current folder, we only have to provide train.csv as an argument:
    # Load the data as a DataFrame
    df = pd.read_csv('train.csv')

The returned object is a dataframe object, which provides many useful methods for data processing.

  1. Now, we can display the first five lines of the loaded file using the .head() method:
    # Display the first 5 rows of the dataset
    df.head()

This code will output the following:

   PassengerId  Survived  Pclass  \
0        1            0         3
1        2            1         1
2        3            1         3
3        4            1         1
4        5            0         3
      Name                      Sex   Age     SibSp  \
0   Braund, Mr. Owen Harris     male  22.0       1
1  Cumings, Mrs. John Bradley (Florence Briggs Th...
                               female  38.0        1
2  Heikkinen, Miss. Laina  female  26.0        0
3  Futrelle, Mrs. Jacques Heath (Lily May Peel)
                            female  35.0        1
4  Allen, Mr. William Henry     male  35.0        0
 Parch      Ticket   Fare   Cabin        Embarked
0  0         A/5   21171   7.2500   NaN           S
1  0       PC 17599  71.2833   C85       C
2  0      STON/O2. 3101282   7.9250   NaN       S
3  0        113803  53.1000  C123           S
4  0        373450   8.0500   NaN    S

Here is a description of the data types in each column:

  • PassengerId (qualitative): A unique, arbitrary ID for each passenger.
  • Survived (qualitative): 1 for yes, 0 for no. This is our label, so this is a binary classification problem.
  • Pclass (quantitative, discrete): The class, which is arguably quantitative. Is class 1 better than class 2? Most likely yes.
  • Name (unstructured): The name and title of the passenger.
  • Sex (qualitative): The registered sex of the passenger, either male or female.
  • Age (quantitative, discrete): The age of the passenger.
  • SibSp (quantitative, discrete): The number of siblings and spouses on board.
  • Parch (quantitative, discrete): The number of parents and children on board.
  • Ticket (unstructured): The ticket reference.
  • Fare (quantitative, continuous): The ticket price.
  • Cabin (unstructured): The cabin number, which is arguably unstructured. It can be seen as a qualitative feature with high cardinality.
  • Embarked (qualitative): The embarked city, either Southampton (S), Cherbourg (C), or Queenstown (Q).

There’s more…

Let’s talk about the different types of data that are available. Data is a very generic word and can describe many things. We are surrounded by data all the time. One way to specify data is using opposites.

Data can be structured or unstructured:

  • Structured data comes in the form of tables, databases, Excel files, CSV files, and JSON files.
  • Unstructured data does not fit in a table: it can be text, sound, image, videos, and so on. Even if we tend to have tabular representation, this kind of data does not naturally fit in an Excel table.

Data can be quantitative or qualitative.

Quantitative data is ordered. Here are some examples:

  • €100 is greater than €10
  • 1.8 meters is taller than 1.6 meters
  • 18 years old is younger than 80 years old

Qualitative data has no intrinsic order, as shown here:

  • Blue is not intrinsically better than red
  • A dog is not intrinsically greater than a cat
  • A kitchen is not intrinsically more useful than a bathroom

These are not mutually exclusive. An object can have both quantitative and qualitative features, as can be seen in the case of the car in the following figure:

Figure 2.3 – A single object depicted by both quantitative (left) and qualitative (right) features

Figure 2.3 – A single object depicted by both quantitative (left) and qualitative (right) features

Finally, data can be continuous or discrete.

Some data is continuous, as follows:

  • A weight
  • A volume
  • A price

On the other hand, some data is discrete:

  • A color
  • A football score
  • A nationality

Note

Discrete != qualitative.

For example, a football score is discrete, but there is an intrinsic order: 3 points is more than 2.

See also

The pandas read_csv function has a lot of flexibility as it can use other separators, handle headers, and much more. This is described in the official documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html.

The pandas library allows I/O operations that have different types of inputs. For more information, have a look at the official documentation: https://pandas.pydata.org/docs/reference/io.html.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
The Regularization Cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon