Book Image

Data Augmentation with Python

By : Duc Haba
Book Image

Data Augmentation with Python

By: Duc Haba

Overview of this book

Data is paramount in AI projects, especially for deep learning and generative AI, as forecasting accuracy relies on input datasets being robust. Acquiring additional data through traditional methods can be challenging, expensive, and impractical, and data augmentation offers an economical option to extend the dataset. The book teaches you over 20 geometric, photometric, and random erasing augmentation methods using seven real-world datasets for image classification and segmentation. You’ll also review eight image augmentation open source libraries, write object-oriented programming (OOP) wrapper functions in Python Notebooks, view color image augmentation effects, analyze safe levels and biases, as well as explore fun facts and take on fun challenges. As you advance, you’ll discover over 20 character and word techniques for text augmentation using two real-world datasets and excerpts from four classic books. The chapter on advanced text augmentation uses machine learning to extend the text dataset, such as Transformer, Word2vec, BERT, GPT-2, and others. While chapters on audio and tabular data have real-world data, open source libraries, amazing custom plots, and Python Notebook, along with fun facts and challenges. By the end of this book, you will be proficient in image, text, audio, and tabular data augmentation techniques.
Table of Contents (17 chapters)
1
Part 1: Data Augmentation
4
Part 2: Image Augmentation
7
Part 3: Text Augmentation
10
Part 4: Audio Data Augmentation
13
Part 5: Tabular Data Augmentation

Data input types

The four data input types are self-explanatory, but it is worth clearly defining the data input types and what is out of scope:

  • Image definition
  • Text definition
  • Audio definition
  • Tabular data definition
Figure 1.1 – Image, text, tabular, and audio augmentation

Figure 1.1 – Image, text, tabular, and audio augmentation

Figure 1.1 provides a sneak peek at image, text, tabular and audio augmentation. Later in this book, you will learn how to implement augmentation methods.

Let’s get started with images.

Image definition

Image is a large category because you can represent almost anything as an image, such as people, landscapes, animals, plants, and various objects around us. Pictures can also represent action, such as sports, sign language, yoga poses, and many more. One particularly creative use of images is capturing a computer mouse’s movement over time to predict whether a user is a computer hacker or not.

The techniques for increasing the number of pictures are horizontal flip, vertical flip, enlarge, zoom in, zoom out, skew, warp, and lighting. Humans are experts at processing images. Thus, if a picture is slightly distorted or darkened, you can still tell that it is the same image. However, this is not the same for a computer. AI represents a color picture as a three-dimensional array of float numbers – the width, height, and RGB as depth. Any image distortion will yield an array with different values.

Graphs, such as time series data charts, and mathematical equation plots, such as 3D topology plots, are outside the scope of image augmentation.

Fun fact

You can eliminate the overfitting problem in DL image classification training by creatively using data augmentation methods.

Text augmentation has different concerns than image augmentation. Let’s take a look.

Text definition

The primary text input data is in English, but the same techniques for text augmentation can be applied to other West Germanic languages. Python lessons use English as the text input data.

The techniques for supplementing the text input are back translation, easy data augmentation, and albumentation. A few methods might be counterintuitive at first glance, such as deleting or swamping words in a sentence. However, it is an acceptable practice because, in the real world, not everyone writes perfect English.

For example, movie reviewers on the American Multi-Cinema (AMC) website write incomplete or grammatically incorrect sentences. They omit verbs or use inappropriate words. As a rule of thumb, you should not expect perfect English for text input data in many NLP projects.

If an NLP model is trained in perfect English as text input data, it could cause bias against typical online reviewers. In other words, the NLP model will predict inaccurately when deployed to a real-world audience. For example, in sentiment analysis, the AI system will predict whether a movie review has a positive or negative sentiment. Suppose you trained the system using a perfect English dataset. In that case, the AI system might forecast a false positive or false negative when people write a short line with misspelled words and grammatical errors.

Language translation, ideograms, and hieroglyphs are outside the scope of this book. Now, let’s look at audio augmentation.

Audio definition

Audio input data can be any sound wave recording such as music, speech, and natural sounds. Sound wave attributes such as amplitude and frequency are represented as graphs, which are technically images, but you can’t use any image augmentation methods for audio input data.

The techniques for expanding audio input are split into two types: waveform and spectrograph. For raw audio, the transformation methods range from time-shifting and pitch scaling to random gain, while for spectrographs, the functions are time masking, time stretching, pitch scaling, and many others.

Speech in a language other than English is outside the scope of this book. This is not due to technical difficulties but rather because this book is written in English. Writing about the aftermath effects of switching to a different language would be problematic.Audio augmentation is demanding, but tabular data is even more challenging to expand.

Tabular data definition

Tabular data is information in a relational database, spreadsheet, or text file in comma-separated values (CSV) format. Tabular data augmentation is a fast-growing field in ML and DL. The tabular data augmentation techniques are transforming, interacting, mapping, and extraction.

Fun challenge

Here is a thought experiment. Can you think of data types other than image, text, audio, and tabular? A hint is Casablanca and Blade Runner.

There are two parts to this chapter. The first half discussed the various concepts and techniques; what follows is hands-on Python coding on a Python Notebook. The book will use this learn-then-code pattern in all the chapters. It is time to get your hands dirty and write Python code.