Preparing data

Data is to machine learning models as is the fuel to your car, the electricity to your electronic devices, and the food for your body. A machine learning model works by trying to capture the relationships between the provided input and output data. Similar to how human brains work, a machine learning model will attempt to iterate through collected data examples and slowly build a memory of the patterns required to map the provided input data to the provided target output data. The data preparation stage consists of methods and processes required to prepare ready-to-use data to build a machine learning model that includes the following:

Acquisition of raw input and targeted output data
Exploratory data analysis of the acquired data
Data pre-processing

We will discuss each of these topics in the following subsections.

Deep learning problem types

Deep learning can be broadly categorized into two problem types, namely supervised learning and unsupervised learning. Both of these problem types involve building a deep learning model that is capable of making informed predictions as outputs, given well-defined data inputs.

Supervised learning is a problem type where labels are involved that act as the source of truth to learn from. Labels can exist in many forms and can be broken down into two problem types, namely classification and regression. Classification is the process where a specific discrete class is predicted among other classes when given input data. Many more complex problems derive from the base classification problem types, such as instance segmentation, multilabel classification, and object detection. Regression, on the other hand, is the process where a continuous numerical value is predicted when given input data. Likewise, complex problem types can be derived from the base regression problem type, such as multi-regression and image bounding box regression.

Unsupervised learning, on the other hand, is a problem type where there aren’t any labels involved and the goals can vary widely. Anomaly detection, clustering, and feature representation learning are the most common problem types that belong to the unsupervised learning category.

We will go through these two problem types separately for deep learning in Chapter 8, Exploring Supervised Deep Learning, and Chapter 9, Exploring Unsupervised Deep Learning.

Next, let’s learn about the things you should consider when acquiring data.

Acquiring data

Acquiring data in the context of deep learning usually involves unstructured data, which includes image data, video data, text data, and audio data. Sometimes, data can be readily available and stored through some business processes in a database but very often, it has to be collected manually from the environment from scratch. Additionally, very often, labels for this data are not readily available and require manual annotation work. Along with the capability of deep learning algorithms to process and digest highly complex data comes the need to feed it more data compared to its machine learning counterparts. The requirement to perform data collection and data annotation in high volumes is the main reason why deep learning is considered to have a high barrier of entry today.

Don’t rush into choosing an algorithm quickly in a machine learning project. Spend a quality amount of time formally defining the features that can be acquired to predict the target variable. Get help from domain experts during the process and brainstorm potential predictive features that relate to the target variable. In actual projects, it is common to spend a big portion of your time planning and acquiring the data while making sure the acquired data is fit for a machine learning model’s consumption and subsequently spending the rest of the time in model building, model deployment, and model governance. A lot of research has been done into handling bad-quality data during the model development stage but most of these techniques aren’t comprehensive and are limited in ways that they can cover up the inherent quality of the data. Displaying ignorance in quality assurance during the data acquisition stage and showing enthusiasm only in the data science portion of the workflow is a strong indicator that the project would be doomed to failure right from the inception stage.

Formulating a data acquisition strategy is a daunting task when you don’t know what it means to have good-quality data. Let’s go through a few pillars of data quality you should consider for your data in the context of actual business use cases and machine learning:

Representativeness: How representative is the data concerning the real-world data population?
Consistency: How consistent are the annotation methods? Does the same pattern match the same label or are there some inconsistencies?
Comprehensiveness: Are all variations of a specific label covered in the collected dataset?
Uniqueness: Does the data contain a lot of duplicated or similar data?
Fairness: Is the collected data biased toward any specific labels or data groups?
Validity: Does the data contain invalid fields? Do the data inputs match up with their labels? Is there missing data?

Let’s look at each of these in detail.

Representativeness

Data should be collected in a way that it mimics what data you will receive during model deployment as much as possible. Very often in research-based deep learning projects, researchers collect their data in a closed environment with controlled environmental variables. One of the reasons researchers prefer collecting data from a controlled environment is that they can build stabler machine learning models and generally try to prove a point. Eventually, when the research paper is published, you see amazing results that were applied using handpicked data to impress. These models, which are built on controlled data, fail miserably when you apply them to random uncontrolled real-world examples. Don’t get me wrong – it’s great to have these controlled datasets available to contribute toward a stabler machine learning model at times, but having uncontrolled real-world examples as a main part of the training and evaluation datasets is key to achieving a generalizable model.

Sometimes, the acquired training data has an expiry date and does not stay representative forever. This scenario is called data drift and will be discussed in more detail in the Managing risk section closer to the end of this chapter. The representativeness metric for data quality should also be evaluated based on the future expectations of the data the model will receive during deployment.

Consistency

Data labels that are not consistently annotated make it harder for machine learning models to learn from them. This happens when the domain ideologies and annotation strategies differ among multiple labelers and are just not defined properly. For example, “Regular” and “Normal” mean the same thing, but to the machine, it’s two completely different classes; so are “Normal” and “normal” with just a capitalization difference!

Practice formalizing a proper strategy for label annotation during the planning stage before carrying out the actual annotation process. Cleaning the data for simple consistency errors is possible post-data annotation, but some consistency errors can be hard to detect and complex to correct.

Comprehensiveness

Machine learning thrives in building a decisioning mechanism that is robust to multiple variations and views of any specific label. Being capable and accomplishing it are two different things. One of the prerequisites of decisioning robustness is that the data that’s used for training and evaluation itself has to be comprehensive enough to provide coverage for all possible variations of each provided label. How can comprehensiveness be judged? Well, that depends on the complexity of the labels and how varied they can present themselves naturally when the model is deployed. More complex labels naturally require more samples and less complex labels require fewer samples.

A good point to start with, in the context of deep learning, is to have at least 100 samples for each label and experiment with building a model and deriving model insights to see if there are enough samples for the model to generalize on unseen variations of the label. When the model doesn’t produce convincing results, that’s when you need to cycle back to the data preparation stage again to acquire more data variations of any specific label. The machine learning life cycle is inherently a cyclical process where you will experiment, explore, and verify while transitioning between stages to obtain the answers you need to solve your problems, so don’t be afraid to execute these different stages cyclically.

Uniqueness

While having complete and comprehensive data is beneficial to build a machine learning model that is robust to data variations, having duplicated versions of the same data variation in the acquired dataset risks creating a biased model. A biased model makes biased decisions that can be unethical and illegal and sometimes renders such decisions meaningless. Additionally, the amount of data acquired for any specific label is rendered meaningless when all of them are duplicated or very similar to each other.

Machine learning models are generally trained on a subset of the acquired data and then evaluated on other subsets of the data to verify the model’s performance on unseen data. When the part of the dataset that is not unique gets placed in the evaluation partition of the acquired dataset by chance, the model risks reporting scores that are biased against the duplicated data inputs.

Fairness

Does the acquired dataset represent minority groups properly? Is the dataset biased toward the majority groups in the population? There can be many reasons why a machine learning model turns out to be biased, but one of the main causes is data representation bias. Making sure the data is represented fairly and equitably is an ethical responsibility of all machine learning practitioners. There are a lot of types of bias, so this topic will have its own section and will be introduced along with methods of mitigating it in Chapter 13, Exploring Bias and Fairness.

Validity

Are there outliers in the dataset? Is there missing data in the dataset? Did you accidentally add a blank audio or image file to the properly collected and annotated dataset? Is the annotated label for the data input considered a valid label? These are some of the questions you should ask when considering the validity of your dataset.

Invalid data is useless for machine learning models and some of these complicate the pre-processing required for them. The reasons for invalidity can range from simple human errors to complex domain knowledge mistakes. One of the methods to mitigate invalid data is to separate validated and unvalidated data. Include some form of automated or manual data validation process before a data sample gets included in the validated dataset category. Some of this validation logic can be derived from business processes or just common sense. For example, if we are taking age as input data, there are acceptable age ranges and there are age ranges that are just completely impossible, such as 1,000 years old. Having simple guardrails and verifying these values early when collecting them makes it possible to correct them then and there to get accurate and valid data. Otherwise, these data will likely be discarded when it comes to the model-building stage. Maintaining a structured framework to validate data ensures that the majority of the data stays relevant and usable by machine learning models and free from simple mistakes.

As for more complex invalidity, such as errors in the domain ideology, domain experts play a big part in making sure the data stays sane and logical. Always make sure you include domain experts when defining the data inputs and outputs in the discussion about how data should be collected and annotated for model development.

Making sense of data through exploratory data analysis (EDA)

After the acquisition of data, it is crucial to analyze the data to inspect its characteristics, patterns that exist, and the general quality of the data. Knowing the type of data you are dealing with allows you to plan a strategy for the subsequent model-building stage. Plot distribution graphs, calculate statistics, and perform univariate and multivariate analysis to understand the inherent relationships between the data that can help further ensure the validity of the data. The methods of analysis for different variable types are different and can require some form of domain knowledge beforehand. In the following subsections, we will be practically going through exploratory data analysis (EDA) for text-based data to get a sense of the benefits of carrying out an EDA task.

Practical text EDA

In this section, we will be manually exploring and analyzing a text-specific dataset using Python code, with the motive of building a deep learning model later in this book using the same dataset. The dataset we use will predict the categories of an item on an Indian e-commerce website based on its textual description. This use case will be useful to automatically group advertised items for user recommendation usage and can help increase purchasing volume on the e-commerce website:

Let’s start by defining the libraries that we will use in a notebook. We will be using pandas for data manipulation and structuring, matplotlib and seaborn for plotting graphs, tqdm for visualizing iteration progress, and lingua for text language detection:
```
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from lingua import Language, LanguageDetectorBuilder
tqdm.pandas()
```
Next, let’s load the text dataset using pandas:
```
dataset = pd.read_csv('ecommerceDataset.csv')
```
pandas has some convenient functions to visualize and describe the loaded dataset; let’s use them. Let’s start by visualizing three rows of the raw data:
```
dataset.head(3)
```
This will display the following figure in your notebook:

Figure 1.3 – Visualizing the text dataset samples

Next, let’s describe the dataset by visualizing the data column-based statistics:
```
dataset.describe()
```
This will display the following figure in your notebook:

Figure 1.4 – Showing the statistics of the dataset

With these visualizations, it’s obvious that the description of the dataset aligns with what exists in the dataset, where the category column contains four unique class categories paired with the text description data column named description with evidence showing that they are strings. One important insight from the describing function is that there are duplicates in the text description. We can remove duplicates by taking the first example row among all the duplicates, but we also have to make sure that the duplicates have the same category, so let’s do that:
```
for i in tqdm(range(len(unique_description_information))):
     assert(
    len(
      dataset[
        dataset['description'] ==
        unique_description_information.keys()[i]
      ]['category'].unique()
    ) == 1
  )
dataset.drop_duplicates(subset=['description'], inplace=True)
```
Let’s check the data types of the columns:
```
dataset.dtypes
```
This will display the following figure in your notebook:

Figure 1.5 – Showing the data types of the dataset columns

When some samples aren’t inherently a string data type, such as empty data or maybe numbers data, pandas automatically use an Object data type that categorizes the entire column as data types that are unknown to pandas. Let’s check for empty values:
```
dataset.isnull().sum()
```
This gives us the following output:

Figure 1.6 – Checking empty values

It looks like the description column has one empty value, as expected. This might be rooted in a mistake when acquiring the data or it might truly be empty. Either way, let’s remove that row as we can’t do anything to recover it and convert the columns into strings:
```
dataset.dropna(inplace=True)
for column in ['category', 'description']:
    dataset[column] = dataset[column].astype("string")
```
Earlier, we discovered four unique categories. Let’s make sure we have a decent amount of samples for each category by visualizing its distribution:
```
sns.countplot(x="category", data=dataset)
```
This will result in the following figure:

Figure 1.7 – A graph showing category distribution

Each category has a good amount of data samples and doesn’t look like there are any anomaly categories.

The goal here is to predict the category of the selling item through the item’s description on the Indian e-commerce website. From that context, we know that Indian citizens speak Hindi, so the dataset might not contain only English data. Let’s try to estimate and verify the available languages in the dataset using an open sourced language detector tool called Lingua. Lingua uses both rule-based and machine learning model-based methods to detect more than 70 text languages that work great for short phrases, single words, and sentences. Because of that, Lingua has a better runtime and memory performance. Let’s start by initializing the language detector instance from the lingua library:
```
detector = LanguageDetectorBuilder.from_all_languages(
      ).with_preloaded_language_models().build()
```
Now, we will randomly sample a small portion of the dataset to detect language as the detection algorithm takes time to complete. Using a 10% fraction of the data should allow us to adequately understand the data:
```
sampled_dataset = dataset.sample(frac=0.1, random_state=1234)
sampled_dataset['language'] = sampled_dataset[
    'description'
].progress_apply(lambda x: detector.detect_language_of(x))
```
Now, let’s visualize the distribution of the language:
```
sampled_dataset['language'].value_counts().plot(kind='bar')
```
This will show the following graph plot:

Figure 1.8 – Text language distribution

Interestingly, Lingua detected some anomalous samples that aren’t English. The anomalous languages look like they might be mistakes made by Lingua. Hindi is also detected among them; this is more convincing than the other languages as the data is from an Indian e-commerce website. Let’s check these samples out:
```
sampled_dataset[
    sampled_dataset['language'] == Language.HINDI
].description.iloc[0]
```
This will show the following text:

Figure 1.9 – Visualizing Hindi text

It looks like there is a mix of Hindi and English here. How about another language, such as French?
```
sampled_dataset[
    sampled_dataset['language'] == Language.FRENCH
].description.iloc[0]
```
This will show the following text:

Figure 1.10 – Visualizing French text

It looks like potpourri was the focused word here as this is a borrowed French word, but the text is still generally English.
Since the list of languages does not include languages that do not use space as a separator between logical word units, let’s attempt to gauge the distribution of words by using a space-based word separation. Word counts and character counts can affect the parameters of a deep learning neural network, so it will be useful to understand these values during EDA:
```
dataset['word_count'] = dataset['description'].apply(
    lambda x: len(x.split())
)
plt.figure(figsize=(15,4))
sns.histplot(data=dataset, x="word_count", bins=10)
```
This will show the following bar plot:

Figure 1.11 – Word count distribution

From the exploration and analysis of the text data, we can deduce a couple of reasons that will help set up the model type and structure we should use during the model development stage:

The labels are decently sampled with 5,000-11,000 worth of samples per label, making them suitable for deep learning algorithms.
The original data is not clean, has missing data, and duplicates but is fixable through manual processing. Using it as-is for model development would have the potential of creating a biased model.
The dataset has more than one language but mostly English text; this will allow us to make appropriate model choices during the model development stage.
An abundance of samples has fewer than 1,000 words, and some samples have 1,000-8,000 words. In some non-critical use cases, we can safely cap the number of words to around 1,000 words so that we can build a model with better memory and runtime performance.

The preceding practical example should provide a simple experience of performing EDA that will be sufficient to understand the benefit and importance of running an in-depth EDA before going into the model development stage. Similar to the practical text EDA, we prepared a practical EDA sample workflow for other datasets that includes audio, image, and video datasets in our Packt GitHub repository that you should explore to get your hands dirty.

A major concept to grasp in this section is the importance of EDA and the level of curiosity you should display to uncover the truth about your data. Some methods are generalizable to other similar datasets, but treating any specific EDA workflow as a silver bullet blinds you to the increasing research people are contributing to this field. Ask questions about your data whenever you suspect something of it and attempt to uncover the answers yourself by doing manual or automated inspections however possible. Be creative in obtaining these answers and stay hungry in learning new ways you can figure out key information on your data.

In this section, we have methodologically and practically gone through EDA processes for different types of data. Next, we will explore what it takes to prepare the data for actual model consumption.

Data pre-processing

Data pre-processing involves data cleaning, data structuring, and data transforming so that a deep learning model will be capable of using the processed data for model training, evaluation, and inferencing during deployment. The processed data should not only be prepared just for the machine learning model to accept but should generally be processed in a way that optimizes the learning potential and increases the metric performance of the machine learning model.

Data cleaning is a process that aims to increase the quality of the data acquired. An EDA process is a prerequisite to figuring out anything wrong with the dataset before some form of data cleaning can be done. Data cleaning and EDA are often executed iteratively until a satisfactory data quality level is achieved. Cleaning can be as simple as duplicate values removal, empty values removal, or removing values that don’t make logical sense, either in terms of common sense or through business logic. These are concepts that we explained earlier, where the same risks and issues are applied.

Data structuring, on the other hand, is a process that orchestrates the data ingest and loading process from the stored data that is cleaned and verified of its quality. This process determines how data should be loaded from a source or multiple of them and fed into the deep learning model. Sounds simple enough, right? This could be very simple if this is a small, single CSV dataset where there wouldn’t be any performance or memory issues. In reality, this could be very complex in cases where data might be partitioned and stored in different sources due to storage limitations. Here are some concrete factors you’d need to consider in this process:

Do you have enough RAM in your computer to process your desired batch size to supply data for your model? Make sure you also take your model size into account so that you won’t get memory overloads and Out of Memory (OOM) errors!
Is your data from different sources? Make sure you have permission to access these data sources.
Is the speed latency when accessing these sources acceptable? Consider moving this data to a better hardware resource that you can access with higher speeds, such as a solid-state drive (SSD) instead of a hard disk drive (HDD), and from a remote network-accessible source to a direct local hardware source.
Do you even have enough local storage to store this data? Make sure you have enough storage to store this data, don’t overload the storage and risk performance slowdowns or worse, computer breakdowns.
Optimize the data loading and processing process so that it is fast. Store and cache outputs of data processes that are fixed so that you can save time that can be used to recompute these outputs.
Make sure the data structuring process is deterministic, even when there are processes that need randomness. Randomly deterministic is when the randomness can be reproduced in a repeat of the cycle. Determinism helps make sure that the results that have been obtained can be reproduced and make sure model-building methods can be compared fairly and reliably.
Log data so that you can debug the process when needed.
Data partitioning methods. Make sure a proper cross-validation strategy is chosen that’s suitable for your dataset. If a time-based feature is included, consider whether you should construct a time-based partitioning method where the training data consists of earlier time examples and the evaluation data is in the future. If not, a stratified partitioning method would be your best bet.

Different deep learning frameworks, such as PyTorch and TensorFlow, provide different application programming interfaces (APIs) to implement the data structuring process. Some frameworks provide simpler interfaces that allow for easy setup pipelines while some frameworks provide complex interfaces that allow for a higher level of flexibility. Fortunately, many high-level libraries attempt to simplify the complex interfaces while maintaining flexibility, such as keras on top of TensorFlow, Catalyst on top of PyTorch, fast ai on top of PyTorch, pytorch lightning on top of PyTorch, and ignite on top of PyTorch.

Finally, data transformation is a process that applies unique data variable-specific pre-processing to transform the raw cleaned data into more a representable, usable, and learnable format. An important factor to consider when attempting to execute the data structuring and transformation process is the type of deep learning model you intend to use. Any form of data transformation is often dependent on the deep learning architecture and dependent on the type of inputs it can accept. The most widely known and common deep learning model architectures are invented to tackle specific data types, such as convolutional neural networks for image data, transformer models for sequence-based data, and basic multilayer perceptrons for tabular data. However, deep learning models are considered to be flexible algorithms that can twist and bend to accept data of different forms and sizes, even in multimodal data conditions. Through collaboration with domain experts from the past few years, deep learning experts have been able to build creative forms of deep learning architectures that can handle multiple data modalities and even multiple unstructured data modalities that succeeded in learning cross-modality patterns. Here are two examples:

Robust Self-Supervised Audio-Visual Speech Recognition, by Meta AI (formerly Facebook) (https://arxiv.org/pdf/2201.01763v2.pdf):
- This tackled the problem of speech recognition in the presence of multiple speeches by building a deep learning transformer-based model that can take in both audio and visual data called AV-HuBERT
- Visual data acted as supplementary data to help the deep learning model discern which speaker to focus on.
- It achieved the latest state-of-the-art results on the LRS3-TED visual and audio lip reading dataset
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, by DAMO Academy and Alibaba Group (https://arxiv.org/pdf/2202.03052v1.pdf):
- They built a model that took in text and image data and published a pre-trained model
- At achieved state-of-the-art results on an image captioning task on the COCO captions dataset

With that being said, data transformations are mainly differentiated into two parts: feature engineering and data scaling. Deep learning is widely known for its feature engineering capabilities, which replace the need to manually craft custom features from raw data for learning. However, this doesn’t mean that it always makes sense to not perform any feature engineering. Many successful deep learning models have utilized engineered forms of features as input.

Now that we know what data pre-processing entails, let’s discuss and explore different data pre-processing techniques for unstructured data, both theoretically and practically.

Text data pre-processing

Text data can be in different languages and exist in different domains, ranging from description data to informational documents and natural human text comments. Some of the most common text data pre-processing methods that are used for deep learning are as follows:

Stemming: A process that removes the suffix of words in an attempt to reduce words into their base form. This promotes the cross-usage of the same features for different forms of the same word.
Lemmatization: A process that reduces a word into its base form that produces real English words. Lemmatization has many of the same benefits as stemming but is considered better due to the linguistically valid word reduction outputs it produces.
Text tokenization, by Byte Pair Encoding (BPE): Tokenization is a process that splits text into different parts that will be encoded and used by the deep learning models. BPE is a sub-word-based text-splitting algorithm that allows common words to be outputted as a single token but rare words get split into multiple tokens. These split tokens can reuse representations from matching sub-words. This is to reduce the vocabulary that can exist at any one time, reduce the amount of out-of-vocabulary tokens, and allow token representations to be learned more efficiently.

One uncommon pre-processing method that will be useful to build more generalizable text deep learning models is text data augmentation. Text data augmentation can be done in a few ways:

Replacing verbs with their synonyms: This can be done by using the set of synonym dictionaries from the NLTK library’s WordNet English lexical database. The obtained augmented text will maintain the same meaning with verb synonym replacement.
Back translation: This involves translating text into another language and back to the original language using translation services such as Google or using open sourced translation models. The obtained back-translated text will be in a slightly different form.

Audio data pre-processing

Audio data is essentially sequence-based data and, in some cases, multiple sequences exist. One of the most commonly used pre-processing methods for audio is raw audio data transformed into different forms of spectrograms using Short-Time Fourier Transform (STFT), which is a process that converts audio from the time domain into the frequency domain. A spectrogram audio conversion allows audio data to be broken down and represented in a range of frequencies instead of a single waveform representation that is a combination of the signals from all audio frequencies. These spectrograms are two-dimensional data and thus can be treated as an image and fed into convolutional neural networks. Data scaling methods such as log scaling and log-mel scaling are also commonly applied to these spectrograms to further emphasize frequency characteristics.

Image data pre-processing

Image data augmentation is a type of image-based feature engineering technique that is capable of increasing the comprehensiveness potential of the original data. A best practice for applying this technique is to structure the data pipeline to apply image augmentations randomly during the training process by batch instead of providing a fixed augmented set of data for the deep learning model. Choosing the type of image augmentation requires some understanding of the business requirements of the use case. Here are some examples where it doesn’t make sense to apply certain augmentations:

When the orientation of the image affects the validity of the target label, orientation modification types of augmentation such as rotation and image flipping wouldn’t be suitable
When the color of the image affects the validity of the target label, color modification types of augmentation such as grayscale, channel shuffle, hue saturation shift, and RGB shift aren’t suitable

After excluding obvious augmentations that won’t be suitable, a common but effective method to figure out the best set of augmentations list is iterative experiments and model comparisons.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

The Deep Learning Architect's Handbook

By : Ee Kin Chin

The Deep Learning Architect's Handbook

By: Ee Kin Chin

Overview of this book

Preparing data

Deep learning problem types