Book Image

Computer Vision on AWS

By : Lauren Mullennex, Nate Bachmeier, Jay Rao
Book Image

Computer Vision on AWS

By: Lauren Mullennex, Nate Bachmeier, Jay Rao

Overview of this book

Computer vision (CV) is a field of artificial intelligence that helps transform visual data into actionable insights to solve a wide range of business challenges. This book provides prescriptive guidance to anyone looking to learn how to approach CV problems for quickly building and deploying production-ready models. You’ll begin by exploring the applications of CV and the features of Amazon Rekognition and Amazon Lookout for Vision. The book will then walk you through real-world use cases such as identity verification, real-time video analysis, content moderation, and detecting manufacturing defects that’ll enable you to understand how to implement AWS AI/ML services. As you make progress, you'll also use Amazon SageMaker for data annotation, training, and deploying CV models. In the concluding chapters, you'll work with practical code examples, and discover best practices and design principles for scaling, reducing cost, improving the security posture, and mitigating bias of CV workloads. By the end of this AWS book, you'll be able to accelerate your business outcomes by building and implementing CV into your production environments with the help of AWS AI/ML services.
Table of Contents (21 chapters)
1
Part 1: Introduction to CV on AWS and Amazon Rekognition
5
Part 2: Applying CV to Real-World Use Cases
9
Part 3: CV at the edge
12
Part 4: Building CV Solutions with Amazon SageMaker
15
Part 5: Best Practices for Production-Ready CV Workloads

Solving business challenges with CV

CV has tremendous business value across a variety of industries and use cases. There have also been recent technological advancements that are generating excitement within the field. The first use case of CV was noted over 60 years ago when a digital scanner was used to transform images into grids of numbers. Today, vision transformers and generative AI allow us to quickly create images and videos from text prompts. The applications of CV are evident across every industry, including healthcare, manufacturing, media and entertainment, retail, agriculture, sports, education, and transportation. Deriving meaningful insights from images and videos has helped accelerate business efficiency and improved the customer experience. In this section, we will briefly cover the latest CV implementations and highlight use cases that we will be diving deeper into throughout this book.

New applications of CV

In 1961, Lawrence Roberts, who is often considered the “father” of CV, presented in his paper Machine Perception of Three-Dimensional Solids (https://dspace.mit.edu/bitstream/handle/1721.1/11589/33959125-MIT.pdf) how a computer could construct a 3D array of objects from a 2D photograph. This groundbreaking paper led researchers to explore the value of image recognition and object detection. Since the discovery of NNs and DL, the field of CV has made great strides in developing more accurate and efficient models. Earlier, we reviewed some of these models, such as CNN and YOLO. These models are widely adopted for a variety of CV tasks. Recently, a new model called vision transformers has emerged that outperforms CNN in terms of accuracy and efficiency. Before we review vision transformers in more detail, let’s summarize the idea of transformers and their relevance in CV.

In order to understand transformers, we first need to explore a DL concept that is used in natural language processing (NLP), called attention. An introduction to transformers and self-attention was first presented in the paper Attention is All You Need (https://arxiv.org/pdf/1706.03762.pdf). The attention mechanism is used in RNN sequence-to-sequence (seq2seq) models. One example of an application of seq2seq models is language translation. This model is composed of an encoder and a decoder. The encoder processes the input sequence, and the decoder generates the transformed output. There are hidden state vectors that take the input sequence and the context vector from the encoder and send them to the decoder to predict the output sequence. The following diagram is an illustration of these concepts:

Figure 1.6 – Translating a sentence from English to German using a seq2seq model

Figure 1.6 – Translating a sentence from English to German using a seq2seq model

In the above, we pay attention to the context of the words in the input to determine the next sequence when generating the output. Another example of attention from Attention is All You Need weighs the importance of different inputs when making predictions. Here is a sentiment analysis example from the paper for a hotel service task, where the bold words are considered relevant:

Figure 1.7 – Example of attention for sentiment analysis from “Attention is All You Need”

Figure 1.7 – Example of attention for sentiment analysis from “Attention is All You Need”

A transformer relies on self-attention, which is defined in the paper as “an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence”. Transformers are important in the application of NLP because they capture the relationship and context of words in text. Take a look at the following sentences:

Andy Jassy is the current CEO of Amazon. He was previously the CEO of Amazon Web Services.

Using transformers, we are able to understand that “He” in the second sentence is referring to Andy Jassy. Without this context of the subject in the first sentence, it is difficult to understand the relationship between the rest of the words in the text.

Now that we’ve reviewed transformers and explained their importance in NLP, how does this relate to CV? The vision transformer was introduced in a 2021 paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (https://arxiv.org/pdf/2010.11929v2.pdf). Vision transformers expand upon the concept of text transformers. The technical details of vision transformers are outside of the scope of this book; however, they have shown great improvement over CNNs for image classification tasks. Transformer architecture has introduced new innovations such as generative AI. Generative AI is blurring the lines between generating separate models for NLP and CV. With generative AI, we can generate images from a text phrase. One image generator developed by OpenAI is called DALL-E (https://openai.com/blog/dall-e-now-available-without-waitlist/). Another example created by Stability AI is called Stable Diffusion (https://huggingface.co/spaces/stabilityai/stable-diffusion). All that is required to generate an image is to type in an English phrase. The following figure shows an example of images generated by Stable Diffusion:

Figure 1.8 - Images created from text “An astronaut in the mountains” using Stable Diffusion

Figure 1.8 - Images created from text “An astronaut in the mountains” using Stable Diffusion

The potential use cases for transformers and generative AI are just beginning to be explored. Throughout the rest of this book, we will discuss the following real-world applications of CV and provide code examples.

Contactless check-in and checkout

To improve the customer experience, many businesses have adopted contactless check-in and checkout processes. This provides a frictionless experience that reduces cost and is easier to scale. It also adds a layer of enhanced security. Instead of checking out from a grocery store using a credit card or trying to remember a PIN, you can use a biometric option such as your palm or facial recognition. In Chapter 3, we will walk through a code example to build a contactless hotel check-in system using identity verification.

Video analysis

You can use CV to analyze videos to detect objects in real time. This helps gather analytics for security footage and helps ensure compliance requirements are met in a manufacturing facility. In the media and entertainment industry, companies can monetize their content by automating the analysis of videos to determine when to insert advertisements. In Chapter 5, we will use CV to annotate and automate security video footage.

Content moderation

The amount of digital content is increasing. Often, this content is moderated manually by human reviewers, which is not a scalable or cost-effective solution. Companies in the gaming, social media, financial services, and healthcare industries are looking to protect their brands, create safe online communities that improve the user experience, meet regulatory and compliance requirements, and reduce the cost of content moderation. CV services combined with additional AI services, such as NLP, can automatically moderate image, video, text, and audio workflows to detect unwanted or offensive content and protect sensitive information. In Chapter 6, we teach you how to incorporate these capabilities into an automated pipeline.

CV at the edge

CV at the edge allows you to run your models locally on an edge device to make real-time predictions and reduce latency. Many ML use cases require that models run on the edge. To meet privacy preservation standards, users’ data needs to be kept directly on devices such as mobile phones, smart cameras, and smart speakers. Also, your devices may be running in places with limited connectivity such as oil drills, or even in space where it is impossible to send your data to the cloud to perform inference. Consultancy firm Deloitte estimates that today there are over 750 million AI devices, and that number is only continuing to grow. What types of use cases can CV solve at the edge? Camera streams on a manufacturing floor can trigger multiple models and alert maintenance teams when equipment defects are identified and can also detect issues in product quality. It also has applications in healthcare. CV models can be deployed on X-ray machines and in operating rooms to quickly process medical images, which helps with faster patient diagnosis. In Chapters 7 and 8, we’ll dive deeper into CV at the edge and provide code examples to solve industrial Internet of Things (IoT) scenarios and defect detection.

In this section, we introduced transformers and discussed their impact on CV. We also covered common challenges that can be solved with CV across multiple industries. These use cases are not an exhaustive list and represent only a small sample of how CV can unlock meaningful insights from your content and accelerate your business outcomes. In the next section, we will introduce the AWS AI/ML services and the benefits of using these services in your downstream applications.