Book Image

Computer Vision on AWS

By : Lauren Mullennex, Nate Bachmeier, Jay Rao
Book Image

Computer Vision on AWS

By: Lauren Mullennex, Nate Bachmeier, Jay Rao

Overview of this book

Computer vision (CV) is a field of artificial intelligence that helps transform visual data into actionable insights to solve a wide range of business challenges. This book provides prescriptive guidance to anyone looking to learn how to approach CV problems for quickly building and deploying production-ready models. You’ll begin by exploring the applications of CV and the features of Amazon Rekognition and Amazon Lookout for Vision. The book will then walk you through real-world use cases such as identity verification, real-time video analysis, content moderation, and detecting manufacturing defects that’ll enable you to understand how to implement AWS AI/ML services. As you make progress, you'll also use Amazon SageMaker for data annotation, training, and deploying CV models. In the concluding chapters, you'll work with practical code examples, and discover best practices and design principles for scaling, reducing cost, improving the security posture, and mitigating bias of CV workloads. By the end of this AWS book, you'll be able to accelerate your business outcomes by building and implementing CV into your production environments with the help of AWS AI/ML services.
Table of Contents (21 chapters)
1
Part 1: Introduction to CV on AWS and Amazon Rekognition
5
Part 2: Applying CV to Real-World Use Cases
9
Part 3: CV at the edge
12
Part 4: Building CV Solutions with Amazon SageMaker
15
Part 5: Best Practices for Production-Ready CV Workloads

Understanding CV

CV is a domain within AI and ML. It enables computers to detect and understand visual inputs (videos and images) to make predictions:

Figure 1.1 – CV is a subdomain of AI and ML

Figure 1.1 – CV is a subdomain of AI and ML

Before we discuss the inner workings of a CV system, let’s summarize the different types of ML algorithms:

  • Supervised learning (SL)—Takes a set of labeled input data and predicts a known target value. For example, a model may be trained on a set of labeled dog images. When a new unlabeled dog image is processed by the model, the model correctly predicts that the image is a dog instead of a cat.
  • Unsupervised learning (UL)—Unlabeled data is provided, and patterns or structures need to be found within the data since no labeled target value is present. One example of UL is a targeted marketing campaign where customers need to be segmented into groups based on various common attributes such as demographics.
  • Semi-supervised learning—consists of unlabeled and labeled data. This is beneficial for CV tasks, since it is a time-consuming process to label individual images. With this method, only some of the images in the dataset need to be labeled, in order to label and classify the unlabeled images.

CV architecture and applications

Now that we’ve covered the different types of ML training methods, how does this relate to CV? DL algorithms are commonly used to solve CV problems. These algorithms are composed of artificial neural networks (ANNs) containing layers of nodes, which function like a neuron in a human brain. A neural network (NN) has multiple layers, including one or more input layers, hidden layers, and output layers. Input data flows through the input layers. The nodes perform transformations of the input data in the hidden layers and produce output to the output layer. The output layer is where predictions of the input data occur. The following figure shows an example of a deep NN (DNN) architecture:

Figure 1.2 – DNN architecture

Figure 1.2 – DNN architecture

How does this architecture apply to real-world applications? With CV and DL technology, you can detect patterns in images and use these patterns for classification. One type of NN that excels in classifying images is a convolutional NN (CNN). CNNs were inspired by ANNs. The way the nodes in a CNN communicate replicates how animals visualize the world. One application of CNNs is classifying X-ray images to assist doctors with medical diagnoses:

Figure 1.3 – Image classification of X-rays

Figure 1.3 – Image classification of X-rays

There are multiple types of problems that CV can solve that we will highlight throughout this book. Localization locates one or more objects in an image and draws a bounding box around the object(s). Object detection uses localization and classification to identify and classify one or multiple objects in an image. These tasks are more complicated than image classification. Faster R-CNN (Regions with CNN), SSD (single shot detector), and YOLO (you only look once) are other types of DNN models that can be used for object detection tasks. These models are designed for performance such as decreasing latency and increasing accuracy.

Segmentation—including instance segmentation and semantic segmentation—highlights the pixels of an image, instead of objects, and classifies them. Segmentation can also be applied to videos to detect black frames, color bars, end credits, and shot changes:

Figure 1.4 – Examples of different CV problem types

Figure 1.4 – Examples of different CV problem types

Despite recent advances in CV and DL, there are still challenges within the field. CV systems are complex, there are vast amounts of data to process, and considerations need to be taken before training a model. It is important to understand the data available since a model is only as good as the quality of your data, and the steps required to prepare the data for model training.

Data processing and feature engineering

CV deals with images and videos, which are a form of unstructured data. Unstructured data does not have a predefined data model and cannot be stored in a database row and column format. This type of data poses unique challenges compared to tabular data. More processing is required to transform the data into a usable format. A computer sees an image as a matrix of pixel values. A pixel is a set of numbers between 0-255 in the red, green, blue (RGB) system. Images vary in their resolutions, dimensions, and colors. In order to train a model, CV algorithms require that images are normalized such that they are the same size. Additional image processing techniques include resizing, rotating, enhancing the resolution, and converting from RGB to grayscale. Another technique is image masking, which allows us to focus on a region of interest. In the following photos, we apply a mask to highlight the motorcycle:

Figure 1.5 – Applying an image mask to highlight the motorcycle

Figure 1.5 – Applying an image mask to highlight the motorcycle

Preprocessing is important since images are often large and take up lots of storage. Resizing an image and converting it to grayscale can speed up the ML training process. However, this technique is not always optimal for the problem we’re trying to solve. For example, in medical image analysis such as skin cancer diagnosis, the colors of the samples are relevant for a proper diagnosis. This is why it’s important to have a complete understanding of the business problem you’re trying to solve before choosing how to process your data. In the following chapters, we’ll provide code examples that detail various image preprocessing steps.

Features or attributes in ML are important input data characteristics that affect the output or target variable of a model. Distinct features in an image help a model differentiate objects from one another. Determining relevant features depends on the context of your business problem. If you’re trying to identify a Golden Retriever dog in a group of images also containing cats, then height is an important feature. However, if you’re looking to classify different types of dogs, then height is not always a distinguishing feature since Golden Retrievers are similar in height to many other dog breeds. In this case, color and coat length might be more useful features.

Data labeling

Data annotation or data labeling is the process of labeling your input datasets. It helps derive value from your unstructured data for SL. Some of the challenges with data labeling are that it is a manual process that is time-consuming, humans have a bias for labeling an object, and it’s difficult to scale. Amazon SageMaker Ground Truth Plus (https://aws.amazon.com/sagemaker/data-labeling/) helps address these challenges by automating this process. It contains a labeling user interface (UI) and quality workflow customizations. The labeling is done by an expert workforce with domain knowledge of the ML tasks to complete. This improves the label quality and leads to better training datasets. In Chapter 9, we will cover a code example using SageMaker Ground Truth Plus.

Amazon Rekognition Custom Labels (https://aws.amazon.com/rekognition/custom-labels-features/) also provides a visual interface to label your images. Labels can be applied to the entire image or you can create bounding boxes to label specific objects. In the next two chapters, we will discuss Amazon Rekognition and Rekognition Custom Labels in more detail.

In this section, we discussed the architecture behind DL CV algorithms. We also covered data processing, feature engineering, and data labeling considerations to create high-quality training datasets. In the next section, we will discuss the evolution of CV and how it can be applied to many different business use cases.