Book Image

Modern Computer Vision with PyTorch

By : V Kishore Ayyadevara, Yeshwanth Reddy
Book Image

Modern Computer Vision with PyTorch

By: V Kishore Ayyadevara, Yeshwanth Reddy

Overview of this book

Deep learning is the driving force behind many recent advances in various computer vision (CV) applications. This book takes a hands-on approach to help you to solve over 50 CV problems using PyTorch1.x on real-world datasets. You’ll start by building a neural network (NN) from scratch using NumPy and PyTorch and discover best practices for tweaking its hyperparameters. You’ll then perform image classification using convolutional neural networks and transfer learning and understand how they work. As you progress, you’ll implement multiple use cases of 2D and 3D multi-object detection, segmentation, human-pose-estimation by learning about the R-CNN family, SSD, YOLO, U-Net architectures, and the Detectron2 platform. The book will also guide you in performing facial expression swapping, generating new faces, and manipulating facial expressions as you explore autoencoders and modern generative adversarial networks. You’ll learn how to combine CV with NLP techniques, such as LSTM and transformer, and RL techniques, such as Deep Q-learning, to implement OCR, image captioning, object detection, and a self-driving car agent. Finally, you'll move your NN model to production on the AWS Cloud. By the end of this book, you’ll be able to leverage modern NN architectures to solve over 50 real-world CV problems confidently.
Table of Contents (25 chapters)
1
Section 1 - Fundamentals of Deep Learning for Computer Vision
5
Section 2 - Object Classification and Detection
13
Section 3 - Image Manipulation
17
Section 4 - Combining Computer Vision with Other Techniques

Comparing AI and traditional machine learning

Traditionally, systems were made intelligent by using sophisticated algorithms written by programmers.

For example, say you are interested in recognizing whether a photo contains a dog or not. In the traditional Machine Learning (ML) setting, an ML practitioner or a subject matter expert first identifies the features that need to be extracted from images. Then they extract those features and pass them through a well-written algorithm that deciphers the given features to tell whether the image is of a dog or not. The following diagram illustrates the same idea:

Take the following samples:

From the preceding images, a simple rule might be that if an image contains three black circles aligned in a triangular shape, it can be classified as a dog. However, this rule would fail against this deceptive close-up of a muffin:

Of course, this rule also fails when shown an image with anything other than a dog's face close up. Naturally, therefore, the number of manual rules we'd need to create for the accurate classification of multiple types can be exponential, especially as images become more complex. Therefore, the traditional approach works well in a very constrained environment (say, taking a passport photo where all the dimensions are constrained within millimeters) and works badly in an unconstrained environment, where every image varies a lot.

We can extend the same line of thought to any domain, such as text or structured data. In the past, if someone was interested in programming to solve a real-world task, it became necessary for them to understand everything about the input data and write as many rules as possible to cover every scenario. This is tedious and there is no guarantee that all new scenarios would follow said rules.

However, by leveraging artificial neural networks, we can do this in a single step.

Neural networks provide the unique benefit of combining feature extraction (hand-tuning) and use those features for classification/regression in a single shot with little manual feature engineering. Both these subtasks only require labeled data (for example, which pictures are dogs and which pictures are not dogs) and neural network architecture. It does not require a human to come up with rules to classify an image, which takes away the majority of the burden traditional techniques impose on the programmer.

Notice that the main requirement is that we provide a considerable amount of examples for the task that needs a solution. For example, in the preceding case, we need to provide lots and lots of dog and not-dog pictures to the model so it learns the features. A high-level view of how neural networks are leveraged for the task of classification is as follows:

Now that we have gained a very high-level overview of the fundamental reason why neural networks perform better than traditional computer vision methods, let's gain a deeper understanding of how neural networks work throughout the various sections in this chapter.