Book Image

Data Augmentation with Python

By : Duc Haba
Book Image

Data Augmentation with Python

By: Duc Haba

Overview of this book

Data is paramount in AI projects, especially for deep learning and generative AI, as forecasting accuracy relies on input datasets being robust. Acquiring additional data through traditional methods can be challenging, expensive, and impractical, and data augmentation offers an economical option to extend the dataset. The book teaches you over 20 geometric, photometric, and random erasing augmentation methods using seven real-world datasets for image classification and segmentation. You’ll also review eight image augmentation open source libraries, write object-oriented programming (OOP) wrapper functions in Python Notebooks, view color image augmentation effects, analyze safe levels and biases, as well as explore fun facts and take on fun challenges. As you advance, you’ll discover over 20 character and word techniques for text augmentation using two real-world datasets and excerpts from four classic books. The chapter on advanced text augmentation uses machine learning to extend the text dataset, such as Transformer, Word2vec, BERT, GPT-2, and others. While chapters on audio and tabular data have real-world data, open source libraries, amazing custom plots, and Python Notebook, along with fun facts and challenges. By the end of this book, you will be proficient in image, text, audio, and tabular data augmentation techniques.
Table of Contents (17 chapters)
1
Part 1: Data Augmentation
4
Part 2: Image Augmentation
7
Part 3: Text Augmentation
10
Part 4: Audio Data Augmentation
13
Part 5: Tabular Data Augmentation

Google Colab

Google Colab Jupyter Notebook with Python is one of the popular options for developing AI and ML projects. All you need is a Gmail account.

Colab can be found at https://colab.research.google.com/. The free Colab version is sufficient for the code in this book; the Pro+ version enables more CPU and GPU RAM.

After logging in to Colab, you can retrieve this book’s Python Notebooks from the following GitHub URL: https://github.com/PacktPublishing/data-augmentation-with-python.

You can start using Colab by using one of the following options:

  • The first method of opening a Python Notebook is copying it from GitHub. From Colab, go to the File menu, choose Open Notebook, and then click on the GitHub tab. In the Repository field, enter the GitHub URL specified previously; refer to Figure 1.2. Lastly, select the chapter and Python Notebook (.ipynb) file:
Figure 1.2 – Loading a Python Notebook from GitHub

Figure 1.2 – Loading a Python Notebook from GitHub

  • The second method of opening a Python Notebook is auto-loading it from GitHub. Go to the GitHub link mentioned previously and click on the Python Notebook (ipynb) file. Click the blue-colored Open in Colab button, as shown in Figure 1.3; it should be on the first line of the Python Notebook. It will launch Colab and load in the Python Notebook automatically:
Figure 1.3 – Loading a Python Notebook from Colab

Figure 1.3 – Loading a Python Notebook from Colab

  • Ensure you save a copy of the Python Notebook to your local Google Drive by clicking on the File menu and selecting the Save a copy in Drive option. Afterward, close the original and use the copy version.
  • The third method of opening a Python Notebook is by downloading a copy from GitHub. Upload the Python Notebook to Colab by clicking on the File menu, choosing Open Notebook, then clicking on the Upload tab, as shown in Figure 1.4:
Figure 1.4 – Loading a Python Notebook by uploading it to Colab

Figure 1.4 – Loading a Python Notebook by uploading it to Colab

Fun fact

For a quick overview of Colab’s features, go to https://colab.research.google.com/notebooks/basic_features_overview.ipynb. For a tutorial on how to use a Python Notebook, go to https://colab.research.google.com/github/cs231n/cs231n.github.io/blob/master/jupyter-notebook-tutorial.ipynb.

Choosing Colab follows the same rationale as selecting an IDE: it is based mainly on your preferences. The following section describes additional Python Notebook options.

Additional Python Notebook options

Python notebooks are available in free and paid versions from many online companies, such as Microsoft, Amazon, Kaggle, Paperspace, and others. Using more than one vendor is typical because a Python Notebook behaves the same way across multiple vendors. However, it is similar to choosing an IDE – once selected, we tend to stay in the same environment.

You can use the following feature criteria to select a Python Notebook:

  • Easy to set up. Can you load and run a Python Notebook in 15 minutes?
  • A free version where you can run the Python Notebooks in this book.
  • Free CPU and GPU.
  • Free permanent storage for the Python Notebooks and versioning.
  • Easy access to GitHub.
  • Easy to upload and download the Python Notebooks to and from the local disk drive.
  • Option to upgrade to a paid version for faster and additional RAM in terms of CPU and GPU.

The choice of Python Notebook is based on your needs, preferences, or familiarity. You don’t have to use Google Colab for the lessons in this book. This book’s Python Notebooks will run on, but are not limited to, the following vendors:

  • Google Colab
  • Kaggle Notebooks
  • Deepnote
  • Amazon SageMaker Studio Lab
  • Paperspace Gradient
  • DataCrunch
  • Microsoft Notebooks in Visual Studio Code

The cloud-based options depend on having fast internet access at all times, so if internet access is a problem, you might want to install the Python Notebook locally on your laptop/computer. The installation process is straightforward.

Installing Python Notebook

Python Notebook can be installed on a local desktop or laptop for Windows, Mac, and Linux. The advantages of the online version are as follows:

  • Fully customizable
  • No limit on runtime – that is, no timeout on the Python Notebook during long training sessions
  • No rules or arbitrary limitations

The disadvantage is that you have to set up and maintain the environment. For example, you must do the following:

  • Install Python and Jupyter Notebook
  • Install and configure the NVIDIA graphic card (optional for data augmentation)
  • Maintain and update dozens of dependency Python libraries
  • Upgrade the disk drive, CPU, and GPU RAM

Installing Python Notebook is easy, requiring just one console or terminal command, but first, check the Python version. Type the following command in the terminal or console application:

>python3 --version

You should have version 3.7.0 or later. If you don’t have Python 3 or have an older version, install Python from https://www.python.org/downloads/.

Install JupyterLab using pip, which contains Python Notebook. On a Windows, Mac, or Linux laptop, use the following command for all three OSs:

>pip install jupyterlab

If you don’t like pip, use conda:

>conda install -c conda-forge jupyterlab

Other than pip and conda, you can use mamba:

>mamba install -c conda-forge jupyterlab

Start JupyterLab or Python Notebook with the following command:

>jupyter lab

The result of installing Python Notebook on a Mac is as follows:

Figure 1.5 – Jupyter Notebook on a local MacBook

Figure 1.5 – Jupyter Notebook on a local MacBook

The next step is cloning this book’s Python Notebook from the respective GitHub link. You can use the GitHub desktop app, the GitHub command on the terminal command line, or the Python Notebook using the magic character exclamation point (!) and standard GitHub command, as follows:

url = 'https://github.com/PacktPublishing/Data-Augmentation-with-Python'
!git clone {url}

Regardless of whether you choose the cloud-based options, such as Google Colab or Kaggle, or work offline, the Python Notebook code will work the same. The following section will dive into the Python Notebook programming style and introduce you to Pluto.