Book Image

Mobile Deep Learning with TensorFlow Lite, ML Kit and Flutter

By : Anubhav Singh, Rimjhim Bhadani
Book Image

Mobile Deep Learning with TensorFlow Lite, ML Kit and Flutter

By: Anubhav Singh, Rimjhim Bhadani

Overview of this book

Deep learning is rapidly becoming the most popular topic in the mobile app industry. This book introduces trending deep learning concepts and their use cases with an industrial and application-focused approach. You will cover a range of projects covering tasks such as mobile vision, facial recognition, smart artificial intelligence assistant, augmented reality, and more. With the help of eight projects, you will learn how to integrate deep learning processes into mobile platforms, iOS, and Android. This will help you to transform deep learning features into robust mobile apps efficiently. You’ll get hands-on experience of selecting the right deep learning architectures and optimizing mobile deep learning models while following an application oriented-approach to deep learning on native mobile apps. We will later cover various pre-trained and custom-built deep learning model-based APIs such as machine learning (ML) Kit through Firebase. Further on, the book will take you through examples of creating custom deep learning models with TensorFlow Lite. Each project will demonstrate how to integrate deep learning libraries into your mobile apps, right from preparing the model through to deployment. By the end of this book, you’ll have mastered the skills to build and deploy deep learning mobile applications on both iOS and Android.
Table of Contents (13 chapters)

Growth of AI-powered mobile devices

AI is becoming more mobile than it used to be, as smaller devices are being packed with more computational power. Mobile devices, which were simply used to make phone calls and send text messages, have now been transformed into smartphones with the introduction of AI. These devices are now capable of leveraging the ever-increasing power of AI to learn user behavior and preferences, enhance photographs, carry out full-fledged conversations, and much more. The capabilities of an AI-powered smartphone is expected to only grow day by day. According to Gartner, by 2022, 80% of smartphones will be AI-enabled. 

Changes in hardware to support AI

To cope with the high computational powers of AI, there have been regular changes and enhancements in hardware support of cellphones to provide them with the ability to think and act. Mobile manufacturing companies have been constantly upgrading hardware support on mobile devices to provide a seamless and personalized user experience.

Huawei has launched the Kirin 970 SoC, which enables on-device AI experiences using a specially dedicated neural network processing unit. Apple devices are fitted with an AI chip called neural engine, which is a part of the A11 Bionic chip. It is dedicated to machine learning and deep learning tasks such as facial and voice recognition, recording animojis, and object detection while capturing a picture. Qualcomm and MediaTek have released their own chips that enable on-device AI solutions. Exynos 9810, announced by Samsung, is a chip that is based on neural networks such as Snapdragon 845 of Qualcomm. The 2018 Samsung devices, Galaxy S9 and S9+, included these chips based on the country where they are marketed. With it's Galaxy S9, the company made it pretty evident that it would integrate AI to improve the functioning of the device's camera and translation of text in real time. The latest Samsung Galaxy S10 series is powered by the Qualcomm Snapdragon 855 to support on-device AI computations.

Google Translate Word Lens and the Bixby personal assistant have been used to develop the feature. With the technologies in place, it is possible for the device to translate up to 54 languages. The phones, which are smart enough to decide between a sensor of f/2.4 and f/1.5, are well suited for capturing photographs in low-light conditions. Google Pixel 2 leverages the power of machine learning to integrate eight image processing units using its coprocessor, Pixel Visual Core.

Why do mobile devices need to have AI chips?

The incorporation of AI chips has not only helped to achieve greater efficiency and computational power, but it has also preserved the user's data and privacy. The advantages of including AI chips on mobile devices can be listed as follows:

  • Performance: The CPUs of mobile devices in the current date are unsuitable to the demands of machine learning. Attempts to deploy machine learning models on these devices often results in slow service and a faster battery drain, leading to bad user experience. This is because the CPUs lack the efficiency to do enormous amounts of small calculations as required by the AI computations. AI chips, somewhat similar to Graphical Processing Units (GPUchips that are responsible for handling graphics on devices, provide a separate space to perform calculations exclusively related to machine learning and deep learning processes. This allows the CPU to focus its time on other important tasks. With the incorporation of specialized AI hardware, the performance and battery life of devices have improved. 
  • User privacy: The hardware also ensures the increased safety of the user's privacy and security. In traditional mobile devices, data analysis and machine learning processes would require chunks of the user's data to be sent to the cloud, posing a threat to the user's data privacy and security of mobile devices. With the on-device AI chips in action, all of the required analyses and calculations can be performed offline on the device itself. This incorporation of dedicated hardware in mobile devices has tremendously reduced the risks of a user's data getting hacked or leaked. 
  • Efficiency: In the real world, tasks such as image recognition and processing could be a lot faster with the incorporation of AI chips. The neural network processing unit by Huawei is a well-suited example here. It can recognize images with an efficiency of 2,000 pictures per second. The company claims that this is 20 times faster than the time taken by a standard CPU. When working with 16-bit floating-point numbers, it can perform 1.92 teraflops, or 1 trillion floating operations every second. The neural engine by Apple can handle around 600 billion operations per second.
  • Economy: On-device AI chips reduce the need to send data off into the cloud. This capability empowers users to access the services offline and save data. Therefore, people using the applications are saved from paying for the servers. This is advantageous to users as well as developers.

Let's look at a brief overview of how AI on mobile devices has impacted the way we interact with our smartphones.

Improved user experience with AI on mobile devices

The use of AI has greatly enhanced user experience on mobile devices. This can be broadly categorized into the following categories.

Personalization

Personalization primarily means modifying a service or a product to suit a specific individual's preferences, sometimes related to clusters of individuals. On mobile devices, the use of AI helps to improve user experience by making the device and apps adapt to a user's habits and their unique profile instead of generic profile-oriented applications. The AI algorithms on mobile devices leverage the available user-specific data, such as location, purchase history, and behavior patterns, to predict and personalize present and future interactions such as a user's preferred activity or music during a particular time of the day.

For instance, AI collects data on the user's purchase history and compiles it with the other data that is obtained from online traffic, mobile devices, sensors embedded in electronic devices, and vehicles. This compiled data is then used to analyze the user's behavior and allow brands to take necessary actions to enhance the user engagement rate. Therefore, users can leverage the benefits of AI-empowered applications to get personalized results, which will reduce their scrolling time and let them explore more products and services.

The best examples out there are recommendation systems running through shopping platforms such as Walmart, Amazon, or media platforms such as YouTube or Netflix.

In the year 2011, Amazon reported a 29% sales increase to $12.83 billion, which was up from $9.9 billion. With its most successful recommendation rate, 35% of Amazon's sales come from customers who followed the recommendations generated by its product recommendation engine.

Virtual assistants

A virtual assistant is an application that understands voice commands and completes tasks for the user. They are able to interpret human speech using Natural Language Understanding (NLU) and generally respond via synthesized voices. You might use a virtual assistant for nearly all of the tasks that a real personal assistant would do for you, that is, making calls to people on your behalf, taking notes that you dictate, turning on or turning off the lights in your home/office with the help of home automation, play music for you, or even simply talk to you about any topic you'd like to talk about! A virtual assistant might be able to take commands in the form of text, audio, or visual gestures. Virtual assistants adapt to user habits over time and get smarter.

Leveraging the power of NLP, a virtual assistant can recognize commands from spoken language, and identify people and pets from images that you upload to your assistant or keep in any online album that is accessible to them.

The most popular virtual assistants on the market right now are Alexa by Amazon, Google Assistant, iPhone's Siri, Cortana by Microsoft, and Bixby running on Samsung devices. Some of the virtual assistants are passive listeners and respond only when they receive a specific wake up command. For example, Google Assistant can be activated using "Hey Google" or "OK Google", and can then be commanded to switch off the lights using "Switch off the bedroom lights" or can be used to call a person from your contacts list using "Make a call to <contact name>". In Google IO '18, Google unveiled the Duplex phone-calling reservation AI, demonstrating that Google Assistant would not only be capable of making a call, but it could also carry on a conversation and potentially book a reservation in a hair salon all by itself.

The use of virtual assistants is growing exponentially and is expected to reach 1.8 billion users by 2021. 54% of users agreed that virtual assistants help make daily tasks simpler, and 31% already use assistants in their daily lives. Additionally, 64% of users take advantage of virtual assistants for more than one purpose.

Facial recognition

The technology that is powerful enough to identify or verify a face or understand a facial expression from digital images and videos is known as facial recognition. This system generally works by comparing the most common and prominent facial features from a given image with the faces stored in a database. Facial recognition also has the ability to understand patterns and variations based on an individual's facial textures and shape to uniquely recognize a person and is often described as a biometric AI-based application.

Initially, facial recognition was a form of computer application; however, recently, it is being widely used on mobile platforms. Facial recognition, accompanied by biometrics such as fingerprint and iris recognition, finds a common application in security systems on mobile devices. Generally, the process of facial recognition is performed in two steps—feature extraction and selection is the first, and the classification of objects is the second. Later developments have introduced several other methods, such as the use of the facial recognition algorithm, three-dimensional recognition, skin texture analysis, and thermal cameras.

Face ID, introduced in Apple's iPhone X, is a biometric authentication successor to the fingerprint-based authentication system found in several Android-based smartphones. The facial recognition sensor of Face ID consists of two parts: a Romeo module and a Juliet module. The Romeo module is responsible for projecting over 30,000 infrared dots on to the face of the user. The counterpart of this module, the Juliet module, reads the pattern formed by the dots on the user's face. The pattern is then sent to an on-device Secure Enclave module in the CPU of the device to confirm whether the face matches with the owner or not. These facial patterns cannot be directly accessed by Apple. The system does not allow the authorization to work when the eyes of the user are closed, which is an added layer of security. 

The technology learns from changes in a user's appearance and works with makeup, beards, spectacles, sunglasses, and hats. It also works in the dark. The Flood Illuminator is a dedicated infrared flash that projects invisible infrared light on to the user's face to properly read the facial points and helps the system to function in low-light conditions or even complete darkness. Contrary to iPhones, Samsung devices primarily rely on two-dimensional facial recognition accompanied by an iris scanner that works as a biometric recognition in Galaxy Note 8. The leading premium smartphone seller in India, OnePlus, also depends on only two-dimensional facial recognition.

The global market for software taking benefit of facial recognition is expected to grow from $3.85 billion USD in 2017 to $9.78 billion USD by 2023. The Asia Pacific region, which holds around 16% of its market share, is the fastest-growing region.

AI-powered cameras

The integration of AI in cameras has empowered them to recognize, understand, and enhance scenes and photographs. AI cameras are able to understand and control the various parameters of cameras. These cameras work on the principles of a digital image processing technique called computational photography. It uses algorithms instead of optical processes seeking to use machine vision to identify and improve the contents of a picture. These cameras use deep learning models that are trained on a huge dataset of images, comprising several million samples, to automatically identify scenes, the availability of light, and the angle of the scene being captured.

When the camera is pointed in the right direction, the AI algorithms of the camera take over to change the settings of the camera to produce the best quality image. Under the hood, the system that enables AI-powered photography is not simple. The models used are highly optimized to produce the correct camera settings upon detection of the features of the scene to be captured in almost real time. They may also add dynamic exposure, color adjustments, and the best possible effect for the image. Sometimes, the images might be postprocessed automatically by the AI models instead of being processed during the clicking of the photograph in order to reduce the computational overhead of the device.

Nowadays, mobile devices are generally equipped with dual-lens cameras. These cameras use two lenses to add the bokeh effect (which is Japanese for "blur") on pictures. The bokeh effect adds a blurry sense to the background around the main subject, making it aesthetically pleasing. AI-based algorithms assist in simulating the effect that identifies the subject and blurs the remaining portion producing portrait effects.

The Google Pixel 3 camera works in two shooting modes called Top Shot and Photobooth. The camera initially captures several frames before and after the moment that the user is attempting to capture. The AI models that are available in the device are then able to pick the best frame. This is made possible by the vast amount of training that is provided to the image recognition system of the camera, which is then able to select the best-looking pictures, almost as if a human were picking them. Photobooth mode allows the user to simply hold the device toward a scene of action, and the images are automatically taken at the moment that the camera predicts to be a picture-perfect moment.

Predictive text

Predictive text is an input technology, generally used in messaging applications, that suggests words to the user depending on the words and phrases that are being entered. The prediction following each keypress is unique rather than producing a repeated sequence of letters in the same constant order. Predictive text can allow an entire word to be inputted by a single keypress, which can significantly speed up the input process. This makes input writing tasks such as typing a text message, writing an email, or making an entry into an address book highly efficient with the use of fewer device keys. The predictive text system links the user's preferred interface style and their level of learned ability to operate the predictive text software. The system eventually gets smarter by analyzing and adapting to the user's language. The T9 dictionary is a good example of such text predictors. It analyzes the frequency of words used and results in multiple most probable words. It is also capable of considering combinations of words.

Quick Type is a predictive text feature that was announced by Apple in its iOS 8 release. It uses machine learning and NLP, which allows the software to build custom dictionaries based on the user's typing habits. These dictionaries are later used for predictions. These prediction systems also depend on the context of the conversation, and they are capable of distinguishing between formal and informal languages. Additionally, it supports multiple languages around the world, including U.S. English, U.K. English, Canadian English, Australian English, French, German, Italian, Brazilian Portuguese, Spanish, and Thai.

Google also introduced a new feature that would help users compose and send emails faster than before. The feature, called Smart Compose, understands the text typed in so that AI can suggest words and phrases to finish sentences. The Smart Compose feature helps users to save time while writing emails by correcting spelling mistakes and grammatical errors, along with suggesting the words that are most commonly typed by users. Smart Reply is another feature, similar to reply suggestions in LinkedIn messaging, which suggests replies that can be sent on a single click, according to the context of the email received by the user. For example, if the user receives an email congratulating them of an accepted application, it is likely that the Smart Reply feature would give options to reply with—"Thank you!," "Thanks for letting me know," and "Thank you for accepting my application." Users can then click on the preferred reply and send a quick response. 

In the 1940s, Lin Yutang created a typewriter in which actuating keys suggested the characters following the selected ones.

Most popular mobile applications that use AI

In recent times, we have seen a great surge in the number of applications incorporating AI into their features for increased user engagement and customized service delivery. In this section, we will briefly discuss how some of the largest players in the domain of mobile apps have leveraged the benefits of AI to boost their business.

Netflix

The best and the most popular example of machine learning in mobile apps is Netflix. The application uses linear regression, logistic regression, and other machine learning algorithms to provide the user with a perfect personalized recommendation experience. The content that is classified by actors, genre, length, reviews, years, and more is used to train the machine learning algorithms. All of these machine learning algorithms learn and adapt to the user's actions, choices, and preferences. For example, John watched the first episode of a new television series but didn't really like it, so he won't watch the subsequent episodes. The recommendation systems involved in Netflix understand that he does not prefer TV shows of that kind and removes them from his recommendations. Similarly, if John picked the eighth recommendation from the recommendations lists or wrote a bad review after watching a movie trailer, the algorithms involved try to adapt to his behavior and preferences to provide extremely personalized content.

Seeing AI

Seeing AI, developed by Microsoft, is an intelligent camera app that uses computer vision to audibly help blind and visually impaired people to know about their surroundings. It comes with functionalities such as reading out short text and documents for the user, giving a description about a person, identifying currencies, colors, handwriting, light, and even images in other apps using the device's camera. To make the app this advanced and responsive in real time, developers have used the idea of making servers communicate with Microsoft Cognitive Services. OCR, barcode scanner, facial recognition, and scene recognition are the most powerful technologies brought together by the application to provide users with a collection of wonderful functionalities.

Allo

Allo was an AI-centric messaging app developed by Google. As of March 2019, Allo has been discontinued. However, it was an important milestone in the journey of AI-powered apps at Google. The application allowed users to perform an action on their Android phones via their voice. It used Smart Reply, a feature that suggested words and phrases by analyzing the context of the conversation. The application was not just limited to text. In fact, it was equally capable of analyzing images shared during a conversation and suggesting replies. This was made possible by powerful image recognition algorithms. Later, this Smart Reply feature was also implemented in the Google inbox and is now present in the Gmail app. 

English Language Speech Assistant

English Language Speech Assistant (ELSA), which is rated among the top five AI-based applications, is the world's smartest AI pronunciation tutor. The mobile application helps people improve their pronunciation. It is designed as an adventure game, differentiated by levels. Each level presents a set of words for the user to pronounce, which is taken as input. The user's response is examined carefully to point out their mistakes and help them improve. When the application detects a wrong pronunciation, it teaches the user the correct one by instructing them about the correct movements of the lips and the tongue so that the word is said correctly.

Socratic

Socratic, a tutor application, allows a user to take pictures of mathematical problems and gives answers explaining the theory behind it, with details of how it should be solved. The application is not just limited to mathematics. Currently, it can help a user in 23 different subjects, including English, physics, chemistry, history, psychology, and calculus. Using the power of AI to analyze the required information, the application returns videos with step-by-step solutions. The application's algorithm, combined with computer vision technology, has the capability to read questions from images. Furthermore, it uses machine learning classifiers trained on millions of sample questions, which helps with the accurate prediction of concepts involved in solving a question.

Now, let's take a deeper look at machine learning and deep learning.