Home Web Development Voice User Interface Projects

Voice User Interface Projects

By Henry Lee
books-svg-icon Book
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $35.99 $24.99
Print $43.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Introduction
About this book
From touchscreen and mouse-click, we are moving to voice- and conversation-based user interfaces. By adopting Voice User Interfaces (VUIs), you can create a more compelling and engaging experience for your users. Voice User Interface Projects teaches you how to develop voice-enabled applications for desktop, mobile, and Internet of Things (IoT) devices. This book explains in detail VUI and its importance, basic design principles of VUI, fundamentals of conversation, and the different voice-enabled applications available in the market. You will learn how to build your first voice-enabled application by utilizing DialogFlow and Alexa’s natural language processing (NLP) platform. Once you are comfortable with building voice-enabled applications, you will understand how to dynamically process and respond to the questions by using NodeJS server deployed to the cloud. You will then move on to securing NodeJS RESTful API for DialogFlow and Alexa webhooks, creating unit tests and building voice-enabled podcasts for cars. Last but not the least you will discover advanced topics such as handling sessions, creating custom intents, and extending built-in intents in order to build conversational VUIs that will help engage the users. By the end of the book, you will have grasped a thorough knowledge of how to design and develop interactive VUIs.
Publication date:
July 2018
Publisher
Packt
Pages
404
ISBN
9781788473354

 

Introduction

In the future, user interfaces will move away from being touch-based and mouse-clicking web interfaces to voice and conversation-based user interfaces. On average, a person speaks about 20,000 words a day (http://bit.ly/2GYZ1du), which is the equivalent of reading half a book, and voice commands are 5 times faster to process (http://hci.stanford.edu/research/speech/index.html) than web user interfaces or typing. The huge efficiency of voice commands is attributed to the fact that speaking is natural to everyone. The technological leaps in natural language processing (NLP), thanks to machine learning (ML) and artificial intelligence (AI), and increased computing power has made it possible to accurately understand and process complex human voices and speech patterns. This book will teach developers how to create voice-enabled applications using NLP platforms such as Google's Dialogflow and Amazon's Alexa Skills Skit, and deploy them to personal digital assistant devices, such as Google Home, Amazon Echo, and Google's Android Auto.

This chapter covers the history of voice user interfaces (VUIs) and discusses what VUIs are. It will help you understand the roles NLP platforms play in the development of VUIs. The chapter also introduces modern voice-enabled applications such as chatbots, personal digital assistant devices, and automobile virtual assistant systems, and how VUIs can be beneficial to such applications. You will then learn about user experience and design principles to deliver compelling voice-enabled applications to consumers. Finally, this chapter ends by looking at both current and predicted future capabilities of digital personal assistant devices.

This chapter will cover the following topics:

  • History of VUIs
  • Basic concepts of VUIs
  • NLP platforms
  • Benefits of VUIs
  • Chatbots, Amazon Echo, Google Home, and automobile virtual assistant systems
  • Best practices, design principles, and user experiences when creating VUIs
  • Current and predicted future capabilities of digital personal assistant devices
 

Technological advancement of VUIs

In 1952, at Bell Labs, the engineers Davis, Biddulph, and Balashek built the Automatic Digit Recognizer (Audrey), a rudimentary voice recognition system. Audrey was limited by the technology of the time but was able to recognize the numbers 0 to 9. The Audrey system, which processed the 10 digits through voice recognition, was 6 feet tall and covered the walls of Bell Labs, containing large numbers of analog circuits with capacitors, amplifiers, and filters. Audrey did the following three things:

  • The Audrey system took the user's voice as input and put the voice into the machine's memory. The voice input was classified and pattern matching was performed against the predefined classes of voices for the numbers 0 to 9. Finally, the identified number was stored in memory.
  • It flashed a light that represented the matching number.
  • It was also able to communicate selected digits over the telephone.

Audrey performed what's known today as NLP, using ML with AI.

Although Audrey recognized voice input with an accuracy of 97% to 99%, it was very expensive and large in size, and it was extremely difficult to maintain its complex electronics. Thus, Audrey could not be commercialized. However, since the inception of Audrey, voice technology and research has continued to leap forward.

First-generation VUIs

The big break came in 1984, when SpeechWorks and Nuance introduced interactive voice response (IVR) systems. IVR systems were able to recognize human voices over the telephone and carried out tasks given to them (Roberto Pieraccini and Lawrence Rabiner 2012, The Voice in the Machine: Building Computers That Understand Speech). You will be able to recognize IVR systems today when you call major companies for support. For example, when you call to make a hotel reservation, you will be familiar with "Press 1 or say reservation, Press 2 or say check reservation, Press 3 or say cancel reservation, Press # or say main menu." In the '90s, I remember working on my first VUIs in an IVR system. To develop the IVR system, I had to work with the Microsoft Speech API (SAPI), http://bit.ly/2osJpdM. With SAPI, I was able to perform text to speech (TTS), where the voice received from the user was translated into text in order to evaluate the user's intent; then, after evaluating the user's intent, a text message was created and converted back to the voice to relay the message to the user on the telephone.

Boom of VUIs

In order to really appreciate the start of the emerging voice technology, first let's look at the year 2005. In 2005, Web 2.0 contributed to the increase in the volume of data. This increase brought about the creation of Hadoop and big data in order to meet the demand for storing, processing, and understanding data. Big data helps to advance data analytics, ML, and AI in order to identify patterns in data in business contexts. The same techniques as those used for big data, such as ML and AI, have helped in advancing NLP to recognize speech patterns and VUIs. The Web 2.0 big data boom kick-started the boom in the use of VUIs on smart phones, in the home, and in automobiles.

History of VUIs on mobile devices

In 2006, Apple introduced the concept of Siri, which allows users to interact with machines using their voice. In 2007, Google followed Apple and introduced voice searches. In 2011, Apple finally brought Siri concepts into reality by integrating Siri into iOS and iPhones. But unfortunately, with Steve Jobs' death that same year, the voice innovations from Apple slowed down, allowing others, such as Google and Amazon, to catch up. In 2015, Microsoft introduced Cortana for the Windows 10 operating system and smart phones (refer to the following screenshot). In 2016, Google introduced Google Assistant (refer to the following screenshot) to mobile devices. Later, from Chapter 3, Build a Fortune Cookie Application, to Chapter 5, Deploying the Fortune Cookie App to Google Home, you will learn how to create voice assistant applications for mobile devices. One of the major advantages of writing applications for Google Assistant is that the same applications you write for Google Assistant can also be deployed to Google Home.

The following illustration depicts screenshots of the mobile voice assistants Cortana, Siri, and Google Assistant:

Mobile voice assistants—Cortana, Siri, and Google Assistant

History of VUIs for Google Home

In 2014, Amazon introduced Amazon Echo (refer to the following screenshot), the first VUI device designed for consumers' home. In 2016, Google released Google Home (refer to the following screenshot). In 2017, Amazon and Google continued to compete against each other in the consumer marketplace with the Amazon Echo and Google Home devices. The competition between Amazon and Google with these home devices shared similarities with the competition between Apple's iPhone and Google's Android. Currently, these home devices lack the third party applications the consumers can use and, as such, huge start-up and entrepreneurial opportunities exist. Remember Angry Birds for iPhone and Android? What could be the next big hit in this untapped marketplace? Later, from Chapter 3, Building a Fortune Cookie Application, through Chapter 8, Migrating the Alexa Cooking Application to Google Home, you will learn how to develop applications for Amazon Echo and Google Home devices.

The following photo shows Amazon Echo:

Amazon Echo

The following is a photo of Google Home:

Google Home

History of VUIs in cars

In 2007, Microsoft partnered with Ford and integrated Microsoft Sync Framework, giving drivers hands-free interaction with their car's features of the car. In 2013, Apple introduced CarPlay for the cars, but only limited number of car manufacturers were willing to adopt CarPlay (https://www.apple.com/ios/carplay/). On the other hand, in 2018, major car manufacturers adopted Google Auto (https://www.android.com/auto/) because Google Auto is based on the Android operating system and already has huge developer ecosystems in the Android marketplace. Later, in Chapter 9, Building a Voice Enabled Podcast for the Car, and Chapter 10, Hosting and Enhancing the Android Auto Podcast, you will learn how to create your own podcast and stream your own content to cars through car dashboards that support Google Auto.

The following photo shows the voice assistant from Apple's CarPlay:

Apple CarPlay

The following screenshot shows Google Auto:

Google Auto
 

Basic design fundamentals of VUIs

There are a plethora of VUI design principles scattered across the internet and in books. I have aggregated the most common and the most important topics in this section, which are useful when developing VUIs. Before diving into the topics, let's first discuss what VUIs are and why they are important.

What are VUIs and why are they important?

VUIs allow you to interact with machines through conversation. If you have ever used the voice-enabled features of Apple's Siri or Google Assistant, you will be well aware of the capabilities they provide, such as asking about today's weather, getting the directions to nearby restaurants, or asking about last night's football scores for your favorite team.

Why is VUI important? Because user interfaces are moving away from touch-based interfaces and more toward conversational interfaces. According to Dr. Mehl from the University of Arizona, on average, humans speak anywhere between 15,000 and 47,000 words a day—equivalent to reading half a book! (http://bit.ly/29hmvcr) The importance and ubiquity of the spoken word in our daily lives will transform next-generation user interfaces from being touch-based to being VUIs.

Let's look at the advantages of VUIs:

  • Speed: Processing voice inquiries is 5 to 10 times faster than typing and searching with the browser using search engines such as Google or Yahoo.

  • Intuitive: Every day, people engage in conversations with other people, and simply asking questions is something that everyone can do without having to learn a new skill.
  • Hands-free: VUIs can eliminate the need for touchscreens. For example, while driving or cooking, you can interact with an application with your voice.
  • Personal: The ability to engage with machines through conversation brings a sense of closeness and almost human-like interactions. This can be a good thing when you want to engage users on a more personal level.

Role of NLP in VUIs

In 1950, Alan Turing published his famous paper entitled Computing Machinery and Intelligence. The paper proposed a way to test whether machines are artificially intelligent. Turing stated that if a machine could trick a human into thinking that they were talking to another human, then that machine is artificially intelligent. Today, this test is known as the Turing Test (https://plato.stanford.edu/entries/turing-test/). In order to pass the Turing Test, machines must understand and speak a human language, and this is known as Natural Language Processing (NLP).

The role of NLP in VUIs is paramount because NLP parses human language so that machines can understand it. In this book, you will be using Node.js, which is, in a sense, a language that machines understand, but Node.js does not understand the human language. This is where NLP comes in, translating spoken language into a language that machines can understand, which in this case is Node.js.

The following is a question and answer, which NLP will be applied to in order to parse it into a language the machine can understand:

Question: When is my son's birthday?
Answer: Your son's birth is on January 1st, 1968.

The following JSON is the result of parsing the preceding question and answer using NLP:

{
"responseId": "a48eecdd-a9d9-4378-8100-2eeec1d95367",
"queryResult": {
"queryText": "When is my son's birthday?",
"parameters": {
"family": "son"
},
"allRequiredParamsPresent": true,
"fulfillmentText": "Your son's birth is on January 1st, 1968.",
"fulfillmentMessages": [{
"text": {
"text": [ "Your son's birth is on January 1st, 1968."]
}
}],
"intent": {
"name": "projects/test-34631/agent/intents/376d04d6-c929-4485-b701-b6083948a054",
"displayName": "birthday"
},
"intentDetectionConfidence": 1,
"diagnosticInfo": { },
"languageCode": "en"
}
}

Note that Google's NLP platform, Dialogflow, parses the question and sends a JSON request that can be processed in a Node.js middle-tier server, as shown previously. In the preceding JSON request, there is the intent.displayName field, which describes the type of question the user asked, and in the Node.js middle-tier server, you can use the intent name to process the request accordingly and respond to the user with the answer.

Furthermore, in the preceding JSON request, there is the queryResult.parameters[0].family field, which describes whose birthday the user was asking about. In NLP combined with ML, you can create a template for the question, allowing the machine to learn variations of possible questions that the user might ask. This is useful because there are many ways to ask about someone's birthday. For example, refer to the italicized words, which will create a template matching pattern for the machine to learn:

  • Do you know my son's birthday?
  • Can you tell me when my son's birthday is?
  • Tell me the birthday of my son.

VUI design platforms

VUI design platforms are those that can handle NLP. In this book, you will be introduced to two of the most popular NLP platforms: Google's Dialogflow and Amazon's Alexa Skills Kit (ASK). You will be using Dialogflow to build a chatbot, Google Assistant, Google Home, and Google Auto. Using ASK, you will learn how to build a voice application for Amazon Echo.

Principles of conversation

In this section, you will learn about the fundamentals of conversation, because a good conversation in VUIs can bring a compelling user experience to an application. You will utilize the design principles covered in this section to build a chatbot in Chapter 2, Building an FAQs Chatbot, home assistant devices such as Google Home in Chapter 3, Building a Fortune Cookie Application, Amazon Echo in Chapter 6, Building a Cooking Application Using Alexa, and a voice-enabled application for cars using Google Auto in Chapter 9, Building a Voice Enabled Podcast for the Car.

Turn taking

A conversation is all about taking turns when people are speaking to each other. There is one who speaks and there is one who listens, and during each turn, the listener should not interrupt the speaker. Look at a simple conversation with MyApp:

Me: Where is the nearest gas station?
MyApp: Here is what I found 1300 Coraopolist Heights Rd

Flow

A conversation can contain multiple turns, exchanging words on specific topics in a very short and concise manner. Avoid using one-dimensional script with predefined one-off questions and answers, where the conversation can quickly turn boring and unnatural. A good example of this is the system IVR, where the user has to wait and listen to all of "Press 1 or say reservation, Press 2 or say check reservation, Press 3 or say cancel reservation, Press # or say main menu." However, nowadays, the user would simply press 0000000000000, which typically sends the user to the operator. To give it a more conversational flow, you should try to anticipate the user's intention and engage with the user.

The following example shows the flow of conversation:

support: How can I help you today?
user: I need cancel my reservation.
support: Would you like to speak to the operator or would you like to do it over the phone?
user: phone
support: Do you have the reservation number?
user: no
support: Then i can look it up by your name and date of the reservation. What is your name and reservation date?

Having the conversation flow makes the user feel like they are having a conversation with a real person whereas in the IVR system the user did not engage with the user but presented all of the options, including options the user did not care about.

Context

In order to create an effective conversation flow, you will need to have contextual knowledge. A context starts with simply asking about the weather. Then, I ask MyApp if I can go snowboarding. MyApp wishes that it would snow so that I could go snowboarding today. MyApp understood the context of my question—that I want to go snowboarding—and responded in a thoughtful way. As we begin to design VUIs, it will be critical to capture the context of the conversation in order to properly process the user's intent and deliver a correct and appropriate answer:

Me: What's weather like today?
MyApp: Cloudy with a chance of snow!
Me: Do you think I will be able to go snowboarding today?
MyApp: I hope it snows!

Verbal confirmation

It is important to let the user know that you have understood their question during the conversation. In the following conversation, note how MyApp responds with Let's see and Here you go before giving the answers to my questions to let me know that MyApp understood my questions:

Me: What time is it?
MyApp: Let's see. It is 3:14 PM.
Me: Do I have any appointment?
MyApp: Here you go. You have one at 5:00 PM.

How about when you are about to process business critical functions, such as ordering candy online? In the following conversation, I begin by ordering candy. Then, MyApp confirms my order by asking whether or not I really want to place an order. Upon saying Yes, MyApp lets me know that my order has been placed. This kind of confirmation reassures the user that their intent has been processed by the system. However, there is a downside to using this kind of confirmation because, if overused, the conversation can become repetitive and annoying to the user. In order to avoid this, we provide a confirmation response based on confidence. Confidence is used in ML in such a way that for every voice that is recognized, the matching algorithm gives a confidence level of how accurate the match is. In the case of the following example, MyApp has a confidence level of 90% that what I asked is about buying 10 candy in order to skip the question that asks whether I really want to buy the candy, and simply responds with a confirmation that my order has been placed. If the confidence level is below the threshold of 90%, MyApp can ask the confirmation question about whether I want to buy 10 candies and request further confirmation:

Me: I would like to order 10 M&M candies.
MyApp: Would you like to buy 10 M&M for $10?
Me: Yes
MyApp: Order has been processed.

Visual confirmation

Sometimes, verbal confirmation will not suffice, and we require visual confirmation to be presented to the user. For example, when asking Google for the nearest gas station, Google will not say out loud the exact address of the gas station, because it would not make sense at all. Instead, you will see that Google provides a clickable visual card and the address opens on Google Maps.

The following screenshot shows the visual card confirmation:

Visual response showing Google Maps

Error handling

When the errors occur in a voice application, the errors must be handled properly and the application must resume where it left off. Let's look at different types of errors in VUIs one at a time. First, imagine that you are talking to your friend and suddenly your friend stops talking in the middle of a sentence. Do you say nothing and wait for hours for your friend to talk and reengage in the conversation? Most likely, you would ask your friend if they are OK and try to reengage them in the conversation. The same is true for VUIs. If no response is provided in a given time frame, the voice application will need to recover and will try to reengage with the user. What if the voice application cannot understand the user's intent? In this case, the application would need to re-ask the question and ensure it doesn't blame the user in any way for any kind of error. Lastly, the user might continuously make the same mistakes in an attempt to provide information. In such cases, every new attempt to get the correct information from the user, the application needs to provide more detailed instruction as to how in hopes that the user can provide the correct intent.

Session awareness

Every time the user engages VUIs, there is specific data the application needs to remember. A good example of this would be the name of the user so that the application addresses the user by their first name. Another good example is the cooking voice application that you will build in Chapter 6, Building a Cooking Application Using Alexa. When a user is making use of cooking app, there are times when the application asks the user to saute the chopped onions till brown. Five minutes later, the user might come back and ask for the next step. The cooking application needs to be smart enough to understand that each step can take minutes to carry out and that the user's session must be maintained in order to correctly interact with the user.

Providing help

The VUIs do not have any visual representation, thus the user might not know what to do and ask for help. The application needs to visually and/or verbally provide help. When providing help, it's a good idea to use examples such as the following one, where, in order to see the photos on the phone, you have to say, Show my photos from last weekend.

The following screenshot shows an example of providing help for the user:

Providing help

Response time

When designing VUIs, the response time is a very important factor. A slow response time can jeopardize the user experience and the quality of the application. Note that here is a set limitation for Alexa and Google on how long the response time should be. For Alexa, the response must be received within 8 seconds, and for Google, the response must be received within 5 seconds. If the response from the backend server does not arrive within the given limit, Alexa and Google will send an exception back to the server. Consider factors that can affect responding to the user's intent, as follows:

  • Slow internet speed, which can cause delays for the users in terms of sending their verbal intent and receiving the response from the server
  • Slow backend server and database can cause slow latency issues when sending the instruction back to the application which in turn delays the responds sent to the users

Empathy

When creating VUIs, you want to make the user feel as though they are talking to a real person. This develops empathy by making a connection with the user. First, empathy can be achieved by allowing the user to choose the voice they want to use. For example, in Google Assistant, the user can choose either a male or female voice by navigating to Settings | Preferences | Assistant voice, shown as follows:

Changing Google Assistant's voice

A second way is to programmatically control the voice using Speech Synthesis Markup Language (SSML), https://www.w3.org/TR/speech-synthesis/, which has been developed to generate synthetic voices for websites and other applications. With SSML, the response to the user can be controlled. Both Amazon Alexa and Google Dialogflow platforms support the use of SSML. Here are the most commonly used SSML tags and their usages in brief:

  • <break time="0.2s" />: Introduces a short pause between sentences.
  • <emphasis level="strong">Come here now!</emphasis>: Create speech that increases in volume and slows down, or decreases in volume and speeds up.
  • <prosody pitch="medium" rate="slow">great!!</prosody>: Used to customize the pitch, speech rate, and volume.
  • <p> Some paragraph goes here </p>: Similar to adding a long break between paragraphs.
  • <s> Some sentence goes here </s>: The equivalent of putting a period at the end to denote the end of the sentence in order to give a short pause.
  • <say-as interpret-as="cardinal">123</say-as>: Indicates the type of text. For example, the cardinal number 123 will be read as one hundred and twenty three. As for the ordinal number 123, it will be read as first, second, and third.
Both Amazon Alexa and Google Dialogflow support limited sets of SSML tags. Ensure that you check the SSML references for Amazon Alexa at http://amzn.to/2BGLt4M and Google Dialogflow at http://bit.ly/2BHBQmq. You will learn more in greater detail in Chapter 2, Building an FAQs Chatbot.

Using SSML, let's create a speech that shows some excitement. You would not want the voice to be monotonous and boring. For example, sometimes you might want to show excitement, and to create such excitement, you can use prosody with a high pitch and slow rate, shown as follows. Also, when you emphasize the word love, you will be able to convey a sense of happiness. You can copy and paste the following SSML code at the Watson Text to Speech service interface, found at http://bit.ly/2AlAc9d. Enter the SSML and the voice will be played back:

<speak>
<p>
<s>
OK <prosody pitch="medium" rate="slow">great!!</prosody>
</s>
</p>
<break time="0.2s" />
<p>
<s>
I <emphasis level="strong">love</emphasis> to see you tomorrow!
</s>
</p>
</speak>

In order to test the SSML using http://bit.ly/2AlAc9d, it is best to use either the Firefox or Chrome browser.
 

Voice-enabled applications

This section discusses the voice-enabled devices that are available today. These include home assistant devices such as the Amazon Echo, and Google Home, which can integrate with other voice-activated home appliances such as lights, garages, thermostats, televisions, stereo systems, and more. Finally, this section will also cover voice-enabled virtual car assistants.

Home assistant devices

These home assistant devices are literally invading homes today. Home assistant devices such as Amazon Echo and Google Home can control security cameras, the dishwasher, dryers, lights, power outlets, switches, door locks, and thermostats. For example, Amazon Echo and Google Home can turn lights on and off, control air conditioning and heating, and open and close garage doors.

There are three assistant devices from Amazon—Echo Plus (http://amzn.to/2CBLfeu), Echo (http://amzn.to/2CcIFey), and Echo Dot (http://amzn.to/2yVB5mA)shown as follows. All three Amazon Echos have features such as set alarm, play music, search online, call Uber, order pizza, and integrate with other home assistant devices, such as WeMo, TP-Link, Sony, Insteon, ecobee, and others.

The only difference between Amazon Echo and Echo Plus is that Amazon Echo Plus has a built-in home assistant hub that allows the other home assistant devices to connect directly to Echo Plus without needing a hub. For Amazon Echo and Echo Dot, you would need to purchase a hub, such as Samsung's SmartThings Smart Home Hub ($73.49, http://amzn.to/2oHkiDV), in order to control other home assistant appliances.

The following photo shows a Smart Home device controlling home appliances:

Smart Home Hub

In the Amazon Echo family, there are two devices that come with a screen, which can be used to display pictures, a clock, and play videos: Amazon Echo Show ($149.00, http://amzn.to/2sJVoRF) and Spot ($129.00, http://amzn.to/2jxEfeR). Similarly, many of the home assistant devices that integrate with Amazon Echo can also be integrated and controlled through Google Home ($79.00, http://bit.ly/2eYZq4D), shown as follows. Many of these devices can be found on the Google support page at http://bit.ly/2yV3NnA.

Google Home is shown here:

Google Home

Automobile virtual assistant devices

Nowadays, almost every car's dashboard comes with a voice recognition system, which enables drivers to control music volumes, change radio stations, add Bluetooth devices, turn inside lights on and off, and more. Since Microsoft Sync introduced the VUI in Ford cars in 2007, there has not been much technological advancement. In 2007, Microsoft Sync was ahead of the times, introducing the first of its kind voice-enabled features in cars, but designing VUIs for Ford had some challenges. First, there was the noise factor while driving at high speed, which hindered the driver's voice from being recognized. Second, being able to upgrade required bringing the car in to a dealer. Third, the car operating system was usually proprietary, hence it was difficult to write programs to introduce new features or enhance existing features.

In 2017, many challenges faced by Microsoft Sync were eliminated. Many cars had more soundproof bodies, which eliminated many of the road noises. Tesla has shown us that, with the right design, a car's system software can be upgraded via Wi-Fi, and many manufacturers have begun to allow car systems to be remotely upgraded. In 2018, many manufacturers began to integrate Google Auto and Apple CarPlay by taking advantage of the Android and iOS operating systems, which have proven track records in the mobile space. Also, car manufacturers taking advantage of Android and iOS brings entire ecosystems of developers together, who can bring innovation by developing voice-enabled applications for cars. For example, there is a project on GitHub (http://bit.ly/2D83MjI) that integrates the Tesla API into Google Home and into mobile devices via Google Assistant, which allows you to check the battery charge level, door status, flash the lights, and honk the horn. Just as developers have brought innovations through applications in the Android and iOS marketplaces, you will see a huge surge in voice-enabled applications in the Google Auto and Apple CarPlay marketplaces. Besides car manufacturers, in 2018, you will begin to see many car stereo manufacturers, such as Pioneer, Alpine, JVC, Kenwood, Sony, and others, begin to support Google Auto as well.

Chatbots

Chatbots have been around for a long time, since the days of internet relay chat (IRC), the popular chat room. Back then, chatbots were used by the chatroom owner to send commands to kick users out, ban users, promote products, provide news news updates, and add more chat room members. However, now, with NLP, chatbots are more than just text-based and are becoming very popular with companies for promoting their products and providing support for them and frequently asked questions (FAQs).

The following screenshot shows a chatbot by Patrón on Twitter:

Patrón Twitter chatbot

Converting existing text-based chat into voice-enabled chat is simple because text-based chats are designed to answer the user's questions. In Chapter 2, Building an FAQs Chatbot, the chatbot will be used to demonstrate the basic functionalities of the NLP platform, and you will learn how to create a chatbot's VUIs. The chat you will create in Chapter 2, Building an FAQs Chatbot, can be used on mobile through Google Assistant or integrated into an existing mobile application.

 

Future of VUIs

VUIs are more than a technological advancement in the voice technology era. VUIs are the next Internet of Things (IoT), ready to skyrocket into a untapped market for the software developers. Amazon Echo and Google Home interconnect home appliances, play music using Spotify, play podcasts of radio stations, stream Netflix movies on demand with a voice command, send messages to friends and family without the use of mobile phones, start a hot shower during the winter, brew a pot of coffee, ask a Samsung refrigerator what needs to be restocked, ask the home air conditioning system to set the heater to 74 degrees in 20 minutes after coming home from a long vacation, and more. The sky's the limit!

Currently, in Amazon Alexa, there are 15,000 voice applications, and Google Home has around 1,000. Many of these voice applications are not even in production yet; you will see a boom in voice applications in the coming year 2019. Many of the big players, such as Twitter, Facebook, Airbnb, and Snapchat have not released any voice applications yet. Look at Logitech releasing the automobile voice interfaces (http://engt.co/2khYV5z) for Ford, Volkswagen, and Volvo. Watch the Samsung and LG refrigerators utilizing VUIs closely (http://on.mash.to/2B4Oj2i). Take a look at Google Pixel Buds (http://bit.ly/2kSP7Uo), which can translate Spanish to English. Imagine a Spanish speaker talking to an English speaker while both wearing Google Pixel Buds; they would be able to communicate with each other. What will companies who create dating applications such as Tinder, Match, OkCupid, and POF bring to the world of voice? What about the educational sector? How about the transportation sector? How about voice-enabled gaming? How about voice-enabled ticketing systems for movie theaters? The possibilities are endless. I hope this book will help fuel your imagination and allow you to catch the wave of the voice era!

 

Summary

In this chapter, you learned about the history and basic fundamentals of VUIs, and about the importance and role of natural language processing in creating VUIs. You also learned best practices, design principles, and the user experience design principals involved in creating compelling VUIs. Then, you learned about various devices, such as Amazon Echo, Google Home, and Auto, where you can deploy your voice-enabled applications. Finally, you should now understand the future of VUIs and where they are heading.

In the next chapter, you will learn how to build you first voice-enabled application, a chatbot, utilizing the Dialogflow NLP platform from Google.

About the Author
  • Henry Lee

    Henry Lee has over 18 years of experience in software engineering. His passion for software engineering has led him to work at various start-ups. Currently, he works as the principal architect responsible for the R&D and the digital strategies. In his spare time, He loves to travel and snowboard, and enjoys discussing the latest technology trends over a cigar! Also, he authored three books at Apress on mobile development.

    Browse publications by this author
Latest Reviews (1 reviews total)
Instantaneous good Keep it Up
Voice User Interface Projects
Unlock this book and the full library FREE for 7 days
Start now