Book Image

Mastering PyTorch - Second Edition

By : Ashish Ranjan Jha
4 (1)
Book Image

Mastering PyTorch - Second Edition

4 (1)
By: Ashish Ranjan Jha

Overview of this book

PyTorch is making it easier than ever before for anyone to build deep learning applications. This PyTorch deep learning book will help you uncover expert techniques to get the most out of your data and build complex neural network models. You’ll build convolutional neural networks for image classification and recurrent neural networks and transformers for sentiment analysis. As you advance, you'll apply deep learning across different domains, such as music, text, and image generation, using generative models, including diffusion models. You'll not only build and train your own deep reinforcement learning models in PyTorch but also learn to optimize model training using multiple CPUs, GPUs, and mixed-precision training. You’ll deploy PyTorch models to production, including mobile devices. Finally, you’ll discover the PyTorch ecosystem and its rich set of libraries. These libraries will add another set of tools to your deep learning toolbelt, teaching you how to use fastai to prototype models and PyTorch Lightning to train models. You’ll discover libraries for AutoML and explainable AI (XAI), create recommendation systems, and build language and vision transformers with Hugging Face. By the end of this book, you'll be able to perform complex deep learning tasks using PyTorch to build smart artificial intelligence models.
Table of Contents (21 chapters)
20
Index

Developing LeNet from scratch

LeNet, originally known as LeNet-5, is one of the earliest CNN models, developed in 1998. The number 5 in LeNet-5 represents the total number of layers in this model, that is, two convolutional and three fully connected layers. With roughly 60,000 total parameters, this model gave state-of-the-art performance on image recognition tasks for handwritten digit images in the year 1998. As expected from a CNN model, LeNet demonstrated rotation, position, and scale invariance as well as robustness against distortion in images. Contrary to the classical machine learning models of the time, such as SVMs, which treated each pixel of the image separately, LeNet exploited the correlation among neighboring pixels.

Note that although LeNet was developed for handwritten digit recognition, it can certainly be extended for other image classification tasks, as we shall see in our next exercise. The following diagram shows the architecture of a LeNet model:

Figure 3.6 – LeNet architecture

Figure 2.6: LeNet architecture

As mentioned earlier, there are two convolutional layers followed by three fully connected layers (including the output layer). This approach of stacking convolutional layers followed by fully connected layers later became a common practice in CNN research and is still applied to the latest CNN models.

This is because as we reach the final convolutional layer output, the output has small spatial dimensions (length and width) but a high depth, which makes the output look like an embedding of the input image. This embedding is like a vector that can be fed into a fully connected network, which is essentially a bunch of fully connected layers. Besides these layers, there are pooling layers in between. These are basically subsampling layers that reduce the spatial size of image representation, thereby reducing the number of parameters and computations as well as effectively condensing the input information. The pooling layer used in LeNet was an average pooling layer that had trainable weights. Soon after, max pooling emerged as the most commonly used pooling function in CNNs.

The numbers in brackets in each layer in the figure demonstrate the dimensions (for input, output, and fully connected layers) or window size (for convolutional and pooling layers). The expected input size for a grayscale image is 32x32 pixels. This image is then operated on by 5x5 convolutional kernels, followed by 2x2 pooling, and so on. The output layer size is 10, representing the 10 classes.

In this section, we will use PyTorch to build LeNet from scratch and train and evaluate it on a dataset of images for the task of image classification. We will see how easy and intuitive it is to build the network architecture in PyTorch using the outline from Figure 2.6.

Furthermore, we will demonstrate how effective LeNet is, even on a dataset different from the ones it was originally developed on (that is, MNIST) and how PyTorch makes it easy to train and test the model in a few lines of code.

Using PyTorch to build LeNet

Observe the following steps to build the model:

  1. For this exercise, we will need to import a few dependencies. Execute the following import statements:
    import numpy as np
    import matplotlib.pyplot as plt
    import torch
    import torchvision
    import torch.nn as nn
    import torch.nn.functional as F
    import torchvision.transforms as transforms
    torch.use_deterministic_algorithms(True)
    

Besides the usual imports, we also invoke the use_deterministic_algorithms function to ensure the reproducibility of this exercise.

  1. Next, we will define the model architecture based on the outline given in Figure 2.6:
    class LeNet(nn.Module):
        def __init__(self):
            super(LeNet, self).__init__()
            # 3 input image channel, 6 output 
            # feature maps and 5x5 conv kernel
            self.cn1 = nn.Conv2d(3, 6, 5)
            # 6 input image channel, 16 output 
            # feature maps and 5x5 conv kernel
            self.cn2 = nn.Conv2d(6, 16, 5)
            # fully connected layers of size 120, 84 and 10
            # 5*5 is the spatial dimension at this layer
            self.fc1 = nn.Linear(16 * 5 * 5, 120) 
            self.fc2 = nn.Linear(120, 84)
            self.fc3 = nn.Linear(84, 10)
        def forward(self, x):
            # Convolution with 5x5 kernel
            x = F.relu(self.cn1(x))
            # Max pooling over a (2, 2) window
            x = F.max_pool2d(x, (2, 2))
            # Convolution with 5x5 kernel
            x = F.relu(self.cn2(x))
            # Max pooling over a (2, 2) window
            x = F.max_pool2d(x, (2, 2))
            # Flatten spatial and depth dimensions 
            # into a single vector
            x = x.view(-1, self.flattened_features(x))
            # Fully connected operations
            x = F.relu(self.fc1(x))
            x = F.relu(self.fc2(x))
            x = self.fc3(x)
            return x
        def flattened_features(self, x):
            # all except the first (batch) dimension
            size = x.size()[1:]  
            num_feats = 1
            for s in size:
                num_feats *= s
            return num_feats
    lenet = LeNet()
    print(lenet)
    

In the last two lines, we instantiate the model and print the network architecture. The output will be as follows:

LeNet(
  (conv1): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

There are the usual __init__ and forward methods for architecture definition and running a forward pass, respectively. The additional flattened_features method is meant to calculate the total number of features in an image representation layer (usually an output of a convolutional layer or pooling layer). This method helps to flatten the spatial representation of features into a single vector of numbers, which is then used as input to fully connected layers.

Besides the details of the architecture mentioned earlier, ReLU is used throughout the network as the activation function. Also, unlike the original LeNet network, which takes in single-channel images, the current model is modified to accept RGB images, that is, three channels, as input. This is done in order to adapt to the dataset that is used for this exercise.

  1. We then define the training routine, that is, the actual backpropagation step:
    def train(net, trainloader, optim, epoch):
        # initialize loss
        loss_total = 0.0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            # ip refers to the input images, and ground_truth 
            # refers to the output classes the images belong to
            ip, ground_truth = data
            # zero the parameter gradients
            optim.zero_grad()
            # forward-pass + backward-pass + optimization -step
            op = net(ip)
            loss = nn.CrossEntropyLoss()(op, ground_truth)
            loss.backward()
            optim.step()
            # update loss
            loss_total += loss.item()
            # print loss statistics
            if (i+1) % 1000 == 0:
                # print at the interval of 1000 mini-batches
                print('[Epoch number : %d, Mini-batches: %5d] \
                      loss: %.3f' % (epoch + 1, i + 1, 
                                     loss_total / 200))
                loss_total = 0.0
    

For each epoch, this function iterates through the entire training dataset, runs a forward pass through the network, and, using backpropagation, updates the parameters of the model based on the specified optimizer. After iterating through each of the 1,000 mini-batches of the training dataset, this method also logs the calculated loss.

  1. Similar to the training routine, we will define the test routine that we will use to evaluate model performance:
    def test(net, testloader):
        success = 0
        counter = 0
        with torch.no_grad():
            for data in testloader:
                im, ground_truth = data
                op = net(im)
                _, pred = torch.max(op.data, 1)
                counter += ground_truth.size(0)
                success += (pred == ground_truth).sum().item()
        print('LeNet accuracy on 10000 images from test dataset: %d %%'\
            % (100 * success / counter))
    

This function runs a forward pass through the model for each test-set image, calculates the correct number of predictions, and prints the percentage of correct predictions on the test set.

  1. Before we get on to training the model, we need to load the dataset. For this exercise, we will be using the CIFAR-10 dataset.

Dataset citation

The images in this section are from Learning Multiple Layers of Features from Tiny Images, Alex Krizhevsky, 2009: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf. They are part of the CIFAR-10 dataset (toronto.edu): https://www.cs.toronto.edu/~kriz/cifar.html

This dataset consists of 60,000 32x32 RGB images labeled across 10 classes, with 6,000 images per class. The 60,000 images are split into 50,000 training images and 10,000 test images. More details can be found at the dataset website [2]. Torch provides the CIFAR10 dataset under the torchvision.datasets module. We will be using the module to directly load the data and instantiate train and test dataloaders as demonstrated in the following code:

# The mean and std are kept as 0.5 for normalizing 
# pixel values as the pixel values are originally 
# in the range 0 to 1
train_transform = transforms.Compose(
    [transforms.RandomHorizontalFlip(),
     transforms.RandomCrop(32, 4),
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), 
                          (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', 
    train=True, download=True, transform=train_transform)
trainloader = torch.utils.data.DataLoader(trainset, 
    batch_size=8, shuffle=True)
test_transform = transforms.Compose([transforms.ToTensor(), 
    transforms.Normalize((0.5, 0.5, 0.5), 
                         (0.5, 0.5, 0.5))])
testset = torchvision.datasets.CIFAR10(root='./data', 
    train=False, download=True, transform=test_transform)
testloader = torch.utils.data.DataLoader(testset, 
    batch_size=10000, shuffle=False)
# ordering is important
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 
           'frog', 'horse', 'ship', 'truck')

In the next chapter, we will download the dataset and write a custom dataset class and a dataloader function. We will not need to write those here, thanks to the torchvision.datasets module.

Because we set the download flag to True, the dataset will be downloaded locally. Then, we shall see the following output:

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100%
170498071/170498071 [00:34<00:00, 5191345.41it/s]
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified

The transformations used for training and testing datasets are different because we apply some data augmentation to the training dataset, such as flipping and cropping, which are not applicable to the test dataset. Also, after defining trainloader and testloader, we declare the 10 classes in this dataset with a pre-defined ordering.

  1. After loading the datasets, let’s investigate how the data looks:
    # define a function that displays an image
    def imageshow(image):
        # un-normalize the image
        image = image/2 + 0.5
        npimage = image.numpy()
        plt.imshow(np.transpose(npimage, (1, 2, 0)))
        plt.show()
    # sample images from training set
    dataiter = iter(trainloader)
    images, labels = next(dataiter)
    # display images in a grid
    num_images = 4
    imageshow(torchvision.utils.make_grid(images[:num_images]))
    # print labels
    print('    '+'  ||  '.join(classes[labels[j]] 
                               for j in range(num_images)))
    

The preceding code shows us four sample images with their respective labels from the training dataset. The output will be as follows:

Figure 2.7: CIFAR-10 dataset samples

The preceding output shows us four color images that are 32x32 pixels in size. These four images belong to four different labels, as displayed in the text following the images.

We will now train the LeNet model.

Training LeNet

Let us train the model with the help of the following steps:

  1. We will define the optimizer and start the training loop as shown here:
    # define optimizer
    optim = torch.optim.Adam(lenet.parameters(), lr=0.001)
    # training loop over the dataset multiple times
    for epoch in range(50):  
        train(lenet, trainloader, optim, epoch)
        print()
        test(lenet, testloader)
        print()
    print('Finished Training')
    

The output will be as follows:

[Epoch number : 1, Mini-batches:  1000] loss: 9.804
[Epoch number : 1, Mini-batches:  2000] loss: 8.783
[Epoch number : 1, Mini-batches:  3000] loss: 8.444
[Epoch number : 1, Mini-batches:  4000] loss: 8.118
[Epoch number : 1, Mini-batches:  5000] loss: 7.819
[Epoch number : 1, Mini-batches:  6000] loss: 7.672
LeNet accuracy on 10000 images from test dataset: 44 %
...
[Epoch number : 50, Mini-batches:  1000] loss: 5.022
[Epoch number : 50, Mini-batches:  2000] loss: 5.067
[Epoch number : 50, Mini-batches:  3000] loss: 5.137
[Epoch number : 50, Mini-batches:  4000] loss: 5.009
[Epoch number : 50, Mini-batches:  5000] loss: 5.107
[Epoch number : 50, Mini-batches:  6000] loss: 4.977
LeNet accuracy on 10000 images from test dataset: 67 %
Finished Training
  1. Once the training is finished, we can save the model file locally:
    model_path = './cifar_model.pth'
    torch.save(lenet.state_dict(), model_path)
    

Having trained the LeNet model, we will now test its performance on the test dataset in the next section.

Testing LeNet

The following steps need to be followed to test the LeNet model:

  1. Let’s make predictions by loading the saved model and running it on the test dataset:
    # load test dataset images
    d_iter = iter(testloader)
    im, ground_truth = next(d_iter)
    # print images and ground truth
    imageshow(torchvision.utils.make_grid(im[:4]))
    print('Label:      ', ' '.join('%5s' % 
                                   classes[ground_truth[j]] 
                                   for j in range(4)))
    # load model
    lenet_cached = LeNet()
    lenet_cached.load_state_dict(torch.load(model_path))
    # model inference
    op = lenet_cached(im)
    # print predictions
    _, pred = torch.max(op, 1)
    print('Prediction: ', ' '.join('%5s' % classes[pred[j]] 
                                   for j in range(4)))
    

The output will be as follows:

Figure 2.8: LeNet predictions

Evidently, three out of four predictions are correct.

  1. Finally, we will check the overall accuracy of this model on the test dataset as well as the per-class accuracy:
    success = 0
    counter = 0
    with torch.no_grad():
        for data in testloader:
            im, ground_truth = data
            op = lenet_cached(im)
            _, pred = torch.max(op.data, 1)
            counter += ground_truth.size(0)
            success += (pred == ground_truth).sum().item()
    print('Model accuracy on 10000 images from test dataset: %d %%'\ 
        % (100 * success / counter))
    

The output will be as follows:

Model accuracy on 10000 images from test dataset: 67 %
  1. For per-class accuracy, the code is as follows:
    class_sucess = list(0. for i in range(10))
    class_counter = list(0. for i in range(10))
    with torch.no_grad():
        for data in testloader:
            im, ground_truth = data
            op = lenet_cached(im)
            _, pred = torch.max(op, 1)
            c = (pred == ground_truth).squeeze()
            for i in range(10000):
                ground_truth_curr = ground_truth[i]
                class_sucess[ground_truth_curr] += c[i].item()
                class_counter[ground_truth_curr] += 1
    for i in range(10):
        print('Model accuracy for class %5s : %2d %%' % (
            classes[i], 100 * class_sucess[i] / class_counter[i]))
    

The output will be as follows:

Model accuracy for class plane : 70 %
Model accuracy for class   car : 83 % 
Model accuracy for class  bird : 45 %  
Model accuracy for class   cat : 37 %  
Model accuracy for class  deer : 80 %  
Model accuracy for class   dog : 52 %  
Model accuracy for class  frog : 81 %  
Model accuracy for class horse : 71 %  
Model accuracy for class  ship : 76 %  
Model accuracy for class truck : 74 %

Some classes have better performance than others. Overall, the model is far from perfect (that is, 100% accuracy) but much better than a model making random predictions, which would have an accuracy of 10% (due to the 10 classes).

Having built a LeNet model from scratch and evaluated its performance using PyTorch, we will now move on to a successor of LeNet – AlexNet. For LeNet, we built the model from scratch, trained, and tested it. For AlexNet, we will use a pretrained model, fine-tune it on a smaller dataset, and test it.