-
Book Overview & Buying
-
Table Of Contents
The Deep Learning Architect's Handbook
By :
Data is to machine learning models as is the fuel to your car, the electricity to your electronic devices, and the food for your body. A machine learning model works by trying to capture the relationships between the provided input and output data. Similar to how human brains work, a machine learning model will attempt to iterate through collected data examples and slowly build a memory of the patterns required to map the provided input data to the provided target output data. The data preparation stage consists of methods and processes required to prepare ready-to-use data to build a machine learning model that includes the following:
We will discuss each of these topics in the following subsections.
Deep learning can be broadly categorized into two problem types, namely supervised learning and unsupervised learning. Both of these problem types involve building a deep learning model that is capable of making informed predictions as outputs, given well-defined data inputs.
Supervised learning is a problem type where labels are involved that act as the source of truth to learn from. Labels can exist in many forms and can be broken down into two problem types, namely classification and regression. Classification is the process where a specific discrete class is predicted among other classes when given input data. Many more complex problems derive from the base classification problem types, such as instance segmentation, multilabel classification, and object detection. Regression, on the other hand, is the process where a continuous numerical value is predicted when given input data. Likewise, complex problem types can be derived from the base regression problem type, such as multi-regression and image bounding box regression.
Unsupervised learning, on the other hand, is a problem type where there aren’t any labels involved and the goals can vary widely. Anomaly detection, clustering, and feature representation learning are the most common problem types that belong to the unsupervised learning category.
We will go through these two problem types separately for deep learning in Chapter 8, Exploring Supervised Deep Learning, and Chapter 9, Exploring Unsupervised Deep Learning.
Next, let’s learn about the things you should consider when acquiring data.
Acquiring data in the context of deep learning usually involves unstructured data, which includes image data, video data, text data, and audio data. Sometimes, data can be readily available and stored through some business processes in a database but very often, it has to be collected manually from the environment from scratch. Additionally, very often, labels for this data are not readily available and require manual annotation work. Along with the capability of deep learning algorithms to process and digest highly complex data comes the need to feed it more data compared to its machine learning counterparts. The requirement to perform data collection and data annotation in high volumes is the main reason why deep learning is considered to have a high barrier of entry today.
Don’t rush into choosing an algorithm quickly in a machine learning project. Spend a quality amount of time formally defining the features that can be acquired to predict the target variable. Get help from domain experts during the process and brainstorm potential predictive features that relate to the target variable. In actual projects, it is common to spend a big portion of your time planning and acquiring the data while making sure the acquired data is fit for a machine learning model’s consumption and subsequently spending the rest of the time in model building, model deployment, and model governance. A lot of research has been done into handling bad-quality data during the model development stage but most of these techniques aren’t comprehensive and are limited in ways that they can cover up the inherent quality of the data. Displaying ignorance in quality assurance during the data acquisition stage and showing enthusiasm only in the data science portion of the workflow is a strong indicator that the project would be doomed to failure right from the inception stage.
Formulating a data acquisition strategy is a daunting task when you don’t know what it means to have good-quality data. Let’s go through a few pillars of data quality you should consider for your data in the context of actual business use cases and machine learning:
Let’s look at each of these in detail.
Data should be collected in a way that it mimics what data you will receive during model deployment as much as possible. Very often in research-based deep learning projects, researchers collect their data in a closed environment with controlled environmental variables. One of the reasons researchers prefer collecting data from a controlled environment is that they can build stabler machine learning models and generally try to prove a point. Eventually, when the research paper is published, you see amazing results that were applied using handpicked data to impress. These models, which are built on controlled data, fail miserably when you apply them to random uncontrolled real-world examples. Don’t get me wrong – it’s great to have these controlled datasets available to contribute toward a stabler machine learning model at times, but having uncontrolled real-world examples as a main part of the training and evaluation datasets is key to achieving a generalizable model.
Sometimes, the acquired training data has an expiry date and does not stay representative forever. This scenario is called data drift and will be discussed in more detail in the Managing risk section closer to the end of this chapter. The representativeness metric for data quality should also be evaluated based on the future expectations of the data the model will receive during deployment.
Data labels that are not consistently annotated make it harder for machine learning models to learn from them. This happens when the domain ideologies and annotation strategies differ among multiple labelers and are just not defined properly. For example, “Regular” and “Normal” mean the same thing, but to the machine, it’s two completely different classes; so are “Normal” and “normal” with just a capitalization difference!
Practice formalizing a proper strategy for label annotation during the planning stage before carrying out the actual annotation process. Cleaning the data for simple consistency errors is possible post-data annotation, but some consistency errors can be hard to detect and complex to correct.
Machine learning thrives in building a decisioning mechanism that is robust to multiple variations and views of any specific label. Being capable and accomplishing it are two different things. One of the prerequisites of decisioning robustness is that the data that’s used for training and evaluation itself has to be comprehensive enough to provide coverage for all possible variations of each provided label. How can comprehensiveness be judged? Well, that depends on the complexity of the labels and how varied they can present themselves naturally when the model is deployed. More complex labels naturally require more samples and less complex labels require fewer samples.
A good point to start with, in the context of deep learning, is to have at least 100 samples for each label and experiment with building a model and deriving model insights to see if there are enough samples for the model to generalize on unseen variations of the label. When the model doesn’t produce convincing results, that’s when you need to cycle back to the data preparation stage again to acquire more data variations of any specific label. The machine learning life cycle is inherently a cyclical process where you will experiment, explore, and verify while transitioning between stages to obtain the answers you need to solve your problems, so don’t be afraid to execute these different stages cyclically.
While having complete and comprehensive data is beneficial to build a machine learning model that is robust to data variations, having duplicated versions of the same data variation in the acquired dataset risks creating a biased model. A biased model makes biased decisions that can be unethical and illegal and sometimes renders such decisions meaningless. Additionally, the amount of data acquired for any specific label is rendered meaningless when all of them are duplicated or very similar to each other.
Machine learning models are generally trained on a subset of the acquired data and then evaluated on other subsets of the data to verify the model’s performance on unseen data. When the part of the dataset that is not unique gets placed in the evaluation partition of the acquired dataset by chance, the model risks reporting scores that are biased against the duplicated data inputs.
Does the acquired dataset represent minority groups properly? Is the dataset biased toward the majority groups in the population? There can be many reasons why a machine learning model turns out to be biased, but one of the main causes is data representation bias. Making sure the data is represented fairly and equitably is an ethical responsibility of all machine learning practitioners. There are a lot of types of bias, so this topic will have its own section and will be introduced along with methods of mitigating it in Chapter 13, Exploring Bias and Fairness.
Are there outliers in the dataset? Is there missing data in the dataset? Did you accidentally add a blank audio or image file to the properly collected and annotated dataset? Is the annotated label for the data input considered a valid label? These are some of the questions you should ask when considering the validity of your dataset.
Invalid data is useless for machine learning models and some of these complicate the pre-processing required for them. The reasons for invalidity can range from simple human errors to complex domain knowledge mistakes. One of the methods to mitigate invalid data is to separate validated and unvalidated data. Include some form of automated or manual data validation process before a data sample gets included in the validated dataset category. Some of this validation logic can be derived from business processes or just common sense. For example, if we are taking age as input data, there are acceptable age ranges and there are age ranges that are just completely impossible, such as 1,000 years old. Having simple guardrails and verifying these values early when collecting them makes it possible to correct them then and there to get accurate and valid data. Otherwise, these data will likely be discarded when it comes to the model-building stage. Maintaining a structured framework to validate data ensures that the majority of the data stays relevant and usable by machine learning models and free from simple mistakes.
As for more complex invalidity, such as errors in the domain ideology, domain experts play a big part in making sure the data stays sane and logical. Always make sure you include domain experts when defining the data inputs and outputs in the discussion about how data should be collected and annotated for model development.
After the acquisition of data, it is crucial to analyze the data to inspect its characteristics, patterns that exist, and the general quality of the data. Knowing the type of data you are dealing with allows you to plan a strategy for the subsequent model-building stage. Plot distribution graphs, calculate statistics, and perform univariate and multivariate analysis to understand the inherent relationships between the data that can help further ensure the validity of the data. The methods of analysis for different variable types are different and can require some form of domain knowledge beforehand. In the following subsections, we will be practically going through exploratory data analysis (EDA) for text-based data to get a sense of the benefits of carrying out an EDA task.
In this section, we will be manually exploring and analyzing a text-specific dataset using Python code, with the motive of building a deep learning model later in this book using the same dataset. The dataset we use will predict the categories of an item on an Indian e-commerce website based on its textual description. This use case will be useful to automatically group advertised items for user recommendation usage and can help increase purchasing volume on the e-commerce website:
pandas for data manipulation and structuring, matplotlib and seaborn for plotting graphs, tqdm for visualizing iteration progress, and lingua for text language detection:import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from tqdm import tqdm from lingua import Language, LanguageDetectorBuilder tqdm.pandas()
pandas:dataset = pd.read_csv('ecommerceDataset.csv')pandas has some convenient functions to visualize and describe the loaded dataset; let’s use them. Let’s start by visualizing three rows of the raw data:dataset.head(3)
This will display the following figure in your notebook:
Figure 1.3 – Visualizing the text dataset samples
dataset.describe()
This will display the following figure in your notebook:
Figure 1.4 – Showing the statistics of the dataset
for i in tqdm(range(len(unique_description_information))): assert( len( dataset[ dataset['description'] == unique_description_information.keys()[i] ]['category'].unique() ) == 1 ) dataset.drop_duplicates(subset=['description'], inplace=True)
dataset.dtypes
This will display the following figure in your notebook:
Figure 1.5 – Showing the data types of the dataset columns
Object data type that categorizes the entire column as data types that are unknown to pandas. Let’s check for empty values:dataset.isnull().sum()
This gives us the following output:
Figure 1.6 – Checking empty values
dataset.dropna(inplace=True)
for column in ['category', 'description']:
dataset[column] = dataset[column].astype("string")sns.countplot(x="category", data=dataset)
This will result in the following figure:
Figure 1.7 – A graph showing category distribution
Each category has a good amount of data samples and doesn’t look like there are any anomaly categories.
lingua library:detector = LanguageDetectorBuilder.from_all_languages( ).with_preloaded_language_models().build()
sampled_dataset = dataset.sample(frac=0.1, random_state=1234) sampled_dataset['language'] = sampled_dataset[ 'description' ].progress_apply(lambda x: detector.detect_language_of(x))
sampled_dataset['language'].value_counts().plot(kind='bar')
This will show the following graph plot:
Figure 1.8 – Text language distribution
sampled_dataset[ sampled_dataset['language'] == Language.HINDI ].description.iloc[0]
This will show the following text:
Figure 1.9 – Visualizing Hindi text
sampled_dataset[ sampled_dataset['language'] == Language.FRENCH ].description.iloc[0]
This will show the following text:
Figure 1.10 – Visualizing French text
dataset['word_count'] = dataset['description'].apply( lambda x: len(x.split()) ) plt.figure(figsize=(15,4)) sns.histplot(data=dataset, x="word_count", bins=10)
This will show the following bar plot:
Figure 1.11 – Word count distribution
From the exploration and analysis of the text data, we can deduce a couple of reasons that will help set up the model type and structure we should use during the model development stage:
The preceding practical example should provide a simple experience of performing EDA that will be sufficient to understand the benefit and importance of running an in-depth EDA before going into the model development stage. Similar to the practical text EDA, we prepared a practical EDA sample workflow for other datasets that includes audio, image, and video datasets in our Packt GitHub repository that you should explore to get your hands dirty.
A major concept to grasp in this section is the importance of EDA and the level of curiosity you should display to uncover the truth about your data. Some methods are generalizable to other similar datasets, but treating any specific EDA workflow as a silver bullet blinds you to the increasing research people are contributing to this field. Ask questions about your data whenever you suspect something of it and attempt to uncover the answers yourself by doing manual or automated inspections however possible. Be creative in obtaining these answers and stay hungry in learning new ways you can figure out key information on your data.
In this section, we have methodologically and practically gone through EDA processes for different types of data. Next, we will explore what it takes to prepare the data for actual model consumption.
Data pre-processing involves data cleaning, data structuring, and data transforming so that a deep learning model will be capable of using the processed data for model training, evaluation, and inferencing during deployment. The processed data should not only be prepared just for the machine learning model to accept but should generally be processed in a way that optimizes the learning potential and increases the metric performance of the machine learning model.
Data cleaning is a process that aims to increase the quality of the data acquired. An EDA process is a prerequisite to figuring out anything wrong with the dataset before some form of data cleaning can be done. Data cleaning and EDA are often executed iteratively until a satisfactory data quality level is achieved. Cleaning can be as simple as duplicate values removal, empty values removal, or removing values that don’t make logical sense, either in terms of common sense or through business logic. These are concepts that we explained earlier, where the same risks and issues are applied.
Data structuring, on the other hand, is a process that orchestrates the data ingest and loading process from the stored data that is cleaned and verified of its quality. This process determines how data should be loaded from a source or multiple of them and fed into the deep learning model. Sounds simple enough, right? This could be very simple if this is a small, single CSV dataset where there wouldn’t be any performance or memory issues. In reality, this could be very complex in cases where data might be partitioned and stored in different sources due to storage limitations. Here are some concrete factors you’d need to consider in this process:
Different deep learning frameworks, such as PyTorch and TensorFlow, provide different application programming interfaces (APIs) to implement the data structuring process. Some frameworks provide simpler interfaces that allow for easy setup pipelines while some frameworks provide complex interfaces that allow for a higher level of flexibility. Fortunately, many high-level libraries attempt to simplify the complex interfaces while maintaining flexibility, such as keras on top of TensorFlow, Catalyst on top of PyTorch, fast ai on top of PyTorch, pytorch lightning on top of PyTorch, and ignite on top of PyTorch.
Finally, data transformation is a process that applies unique data variable-specific pre-processing to transform the raw cleaned data into more a representable, usable, and learnable format. An important factor to consider when attempting to execute the data structuring and transformation process is the type of deep learning model you intend to use. Any form of data transformation is often dependent on the deep learning architecture and dependent on the type of inputs it can accept. The most widely known and common deep learning model architectures are invented to tackle specific data types, such as convolutional neural networks for image data, transformer models for sequence-based data, and basic multilayer perceptrons for tabular data. However, deep learning models are considered to be flexible algorithms that can twist and bend to accept data of different forms and sizes, even in multimodal data conditions. Through collaboration with domain experts from the past few years, deep learning experts have been able to build creative forms of deep learning architectures that can handle multiple data modalities and even multiple unstructured data modalities that succeeded in learning cross-modality patterns. Here are two examples:
With that being said, data transformations are mainly differentiated into two parts: feature engineering and data scaling. Deep learning is widely known for its feature engineering capabilities, which replace the need to manually craft custom features from raw data for learning. However, this doesn’t mean that it always makes sense to not perform any feature engineering. Many successful deep learning models have utilized engineered forms of features as input.
Now that we know what data pre-processing entails, let’s discuss and explore different data pre-processing techniques for unstructured data, both theoretically and practically.
Text data can be in different languages and exist in different domains, ranging from description data to informational documents and natural human text comments. Some of the most common text data pre-processing methods that are used for deep learning are as follows:
One uncommon pre-processing method that will be useful to build more generalizable text deep learning models is text data augmentation. Text data augmentation can be done in a few ways:
Audio data is essentially sequence-based data and, in some cases, multiple sequences exist. One of the most commonly used pre-processing methods for audio is raw audio data transformed into different forms of spectrograms using Short-Time Fourier Transform (STFT), which is a process that converts audio from the time domain into the frequency domain. A spectrogram audio conversion allows audio data to be broken down and represented in a range of frequencies instead of a single waveform representation that is a combination of the signals from all audio frequencies. These spectrograms are two-dimensional data and thus can be treated as an image and fed into convolutional neural networks. Data scaling methods such as log scaling and log-mel scaling are also commonly applied to these spectrograms to further emphasize frequency characteristics.
Image data augmentation is a type of image-based feature engineering technique that is capable of increasing the comprehensiveness potential of the original data. A best practice for applying this technique is to structure the data pipeline to apply image augmentations randomly during the training process by batch instead of providing a fixed augmented set of data for the deep learning model. Choosing the type of image augmentation requires some understanding of the business requirements of the use case. Here are some examples where it doesn’t make sense to apply certain augmentations:
After excluding obvious augmentations that won’t be suitable, a common but effective method to figure out the best set of augmentations list is iterative experiments and model comparisons.
Change the font size
Change margin width
Change background colour