Loading data

The primary focus of this recipe is to load data from a CSV file. However, this is not the only thing that this recipe covers. Since the data is usually the first step in any ML project, this recipe is also a good opportunity to give a quick recap of the ML workflow, as well as the different types of data.

Getting ready

Before loading the data, we should keep in mind that an ML model follows a two-step process:

Train a model on a given dataset to create a new model.
Reuse the previously trained model to infer predictions on new data.

These two steps are summarized in the following figure:

Figure 2.1 – A simple view of the two-step ML process

Of course, in most cases, this is a rather simplistic view. A more detailed view can be seen in Figure 2.2:

Figure 2.2 – A more complete view of the ML process

Let’s take a closer look at the training part of the ML process shown in Figure 2.2:

First, training data is queried from a data source (this can be a database, a data lake, an open dataset, and so on).
The data is preprocessed, such as via feature engineering, rescaling, and so on.
A model is trained and stored (on a data lake, locally, on the edge, and so on).
Optionally, the output of this model is post-processed – for example, via formatting, heuristics, business rules, and more.
Optionally again, this model (with or without postprocessing) is stored in a database for later reference or evaluation if needed.

Now, let’s take a look at the inference part of the ML process:

The data is queried from a data source (a database, an API query, and so on).
The data goes through the same preprocessing step as the training data.
The trained model is fetched if it doesn’t already exist locally.
The model is used to infer output.
Optionally, the output of the model is post-processed via the same post-processing step as the training data.
Optionally, the output is stored in a database for monitoring and later reference.

Even in this schema, many steps were not mentioned: splitting data for training purposes, using evaluation metrics, cross-validation, hyperparameter optimization, and others. This chapter will dive into the more training-specific steps and apply them to the very common but practical Titanic dataset, a binary classification problem. But first, we need to load the data.

To do so, you must download the Titanic dataset training set locally. This can be performed with the following command line:

wget https://github.com/PacktPublishing/The-Regularization-Cookbook/blob/main/chapter_02/train.csv

How to do it…

This recipe is about loading a CSV file and displaying a few lines of code so that we can have a first glance at what it is about:

The first step is to import the required libraries. Here, the only library we need is pandas:
```
import pandas as pd
```
Now, we can load the data using the read_csv function provided by pandas. The first argument is the path to the file. Assuming the file is named train.csv and located in the current folder, we only have to provide train.csv as an argument:
```
# Load the data as a DataFrame
```
```
df = pd.read_csv('train.csv')
```

The returned object is a dataframe object, which provides many useful methods for data processing.

Now, we can display the first five lines of the loaded file using the .head() method:
```
# Display the first 5 rows of the dataset
```
```
df.head()
```

This code will output the following:

   PassengerId  Survived  Pclass  \
0        1            0         3
1        2            1         1
2        3            1         3
3        4            1         1
4        5            0         3
      Name                      Sex   Age     SibSp  \
0   Braund, Mr. Owen Harris     male  22.0       1
1  Cumings, Mrs. John Bradley (Florence Briggs Th...
                               female  38.0        1
2  Heikkinen, Miss. Laina  female  26.0        0
3  Futrelle, Mrs. Jacques Heath (Lily May Peel)
                            female  35.0        1
4  Allen, Mr. William Henry     male  35.0        0
 Parch      Ticket   Fare   Cabin        Embarked
0  0         A/5   21171   7.2500   NaN           S
1  0       PC 17599  71.2833   C85       C
2  0      STON/O2. 3101282   7.9250   NaN       S
3  0        113803  53.1000  C123           S
4  0        373450   8.0500   NaN    S

Here is a description of the data types in each column:

PassengerId (qualitative): A unique, arbitrary ID for each passenger.
Survived (qualitative): 1 for yes, 0 for no. This is our label, so this is a binary classification problem.
Pclass (quantitative, discrete): The class, which is arguably quantitative. Is class 1 better than class 2? Most likely yes.
Name (unstructured): The name and title of the passenger.
Sex (qualitative): The registered sex of the passenger, either male or female.
Age (quantitative, discrete): The age of the passenger.
SibSp (quantitative, discrete): The number of siblings and spouses on board.
Parch (quantitative, discrete): The number of parents and children on board.
Ticket (unstructured): The ticket reference.
Fare (quantitative, continuous): The ticket price.
Cabin (unstructured): The cabin number, which is arguably unstructured. It can be seen as a qualitative feature with high cardinality.
Embarked (qualitative): The embarked city, either Southampton (S), Cherbourg (C), or Queenstown (Q).

There’s more…

Let’s talk about the different types of data that are available. Data is a very generic word and can describe many things. We are surrounded by data all the time. One way to specify data is using opposites.

Data can be structured or unstructured:

Structured data comes in the form of tables, databases, Excel files, CSV files, and JSON files.
Unstructured data does not fit in a table: it can be text, sound, image, videos, and so on. Even if we tend to have tabular representation, this kind of data does not naturally fit in an Excel table.

Data can be quantitative or qualitative.

Quantitative data is ordered. Here are some examples:

€100 is greater than €10
1.8 meters is taller than 1.6 meters
18 years old is younger than 80 years old

Qualitative data has no intrinsic order, as shown here:

Blue is not intrinsically better than red
A dog is not intrinsically greater than a cat
A kitchen is not intrinsically more useful than a bathroom

These are not mutually exclusive. An object can have both quantitative and qualitative features, as can be seen in the case of the car in the following figure:

Figure 2.3 – A single object depicted by both quantitative (left) and qualitative (right) features

Finally, data can be continuous or discrete.

Some data is continuous, as follows:

A weight
A volume
A price

On the other hand, some data is discrete:

A color
A football score
A nationality

Note

Discrete != qualitative.

For example, a football score is discrete, but there is an intrinsic order: 3 points is more than 2.

The Regularization Cookbook

By : Vincent Vandenbussche

The Regularization Cookbook

By: Vincent Vandenbussche

Overview of this book

Loading data

Getting ready

How to do it…

There’s more…

See also

The Regularization Cookbook

By : Vincent Vandenbussche

The Regularization Cookbook

By: Vincent Vandenbussche

Overview of this book

Loading data

Getting ready

How to do it…

There’s more…

See also

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access