Book Image

Voice User Interface Projects

By : Henry Lee
Book Image

Voice User Interface Projects

By: Henry Lee

Overview of this book

From touchscreen and mouse-click, we are moving to voice- and conversation-based user interfaces. By adopting Voice User Interfaces (VUIs), you can create a more compelling and engaging experience for your users. Voice User Interface Projects teaches you how to develop voice-enabled applications for desktop, mobile, and Internet of Things (IoT) devices. This book explains in detail VUI and its importance, basic design principles of VUI, fundamentals of conversation, and the different voice-enabled applications available in the market. You will learn how to build your first voice-enabled application by utilizing DialogFlow and Alexa’s natural language processing (NLP) platform. Once you are comfortable with building voice-enabled applications, you will understand how to dynamically process and respond to the questions by using NodeJS server deployed to the cloud. You will then move on to securing NodeJS RESTful API for DialogFlow and Alexa webhooks, creating unit tests and building voice-enabled podcasts for cars. Last but not the least you will discover advanced topics such as handling sessions, creating custom intents, and extending built-in intents in order to build conversational VUIs that will help engage the users. By the end of the book, you will have grasped a thorough knowledge of how to design and develop interactive VUIs.
Table of Contents (12 chapters)

Basic design fundamentals of VUIs

There are a plethora of VUI design principles scattered across the internet and in books. I have aggregated the most common and the most important topics in this section, which are useful when developing VUIs. Before diving into the topics, let's first discuss what VUIs are and why they are important.

What are VUIs and why are they important?

VUIs allow you to interact with machines through conversation. If you have ever used the voice-enabled features of Apple's Siri or Google Assistant, you will be well aware of the capabilities they provide, such as asking about today's weather, getting the directions to nearby restaurants, or asking about last night's football scores for your favorite team.

Why is VUI important? Because user interfaces are moving away from touch-based interfaces and more toward conversational interfaces. According to Dr. Mehl from the University of Arizona, on average, humans speak anywhere between 15,000 and 47,000 words a day—equivalent to reading half a book! ( The importance and ubiquity of the spoken word in our daily lives will transform next-generation user interfaces from being touch-based to being VUIs.

Let's look at the advantages of VUIs:

  • Speed: Processing voice inquiries is 5 to 10 times faster than typing and searching with the browser using search engines such as Google or Yahoo.

  • Intuitive: Every day, people engage in conversations with other people, and simply asking questions is something that everyone can do without having to learn a new skill.
  • Hands-free: VUIs can eliminate the need for touchscreens. For example, while driving or cooking, you can interact with an application with your voice.
  • Personal: The ability to engage with machines through conversation brings a sense of closeness and almost human-like interactions. This can be a good thing when you want to engage users on a more personal level.

Role of NLP in VUIs

In 1950, Alan Turing published his famous paper entitled Computing Machinery and Intelligence. The paper proposed a way to test whether machines are artificially intelligent. Turing stated that if a machine could trick a human into thinking that they were talking to another human, then that machine is artificially intelligent. Today, this test is known as the Turing Test ( In order to pass the Turing Test, machines must understand and speak a human language, and this is known as Natural Language Processing (NLP).

The role of NLP in VUIs is paramount because NLP parses human language so that machines can understand it. In this book, you will be using Node.js, which is, in a sense, a language that machines understand, but Node.js does not understand the human language. This is where NLP comes in, translating spoken language into a language that machines can understand, which in this case is Node.js.

The following is a question and answer, which NLP will be applied to in order to parse it into a language the machine can understand:

Question: When is my son's birthday?
Answer: Your son's birth is on January 1st, 1968.

The following JSON is the result of parsing the preceding question and answer using NLP:

"responseId": "a48eecdd-a9d9-4378-8100-2eeec1d95367",
"queryResult": {
"queryText": "When is my son's birthday?",
"parameters": {
"family": "son"
"allRequiredParamsPresent": true,
"fulfillmentText": "Your son's birth is on January 1st, 1968.",
"fulfillmentMessages": [{
"text": {
"text": [ "Your son's birth is on January 1st, 1968."]
"intent": {
"name": "projects/test-34631/agent/intents/376d04d6-c929-4485-b701-b6083948a054",
"displayName": "birthday"
"intentDetectionConfidence": 1,
"diagnosticInfo": { },
"languageCode": "en"

Note that Google's NLP platform, Dialogflow, parses the question and sends a JSON request that can be processed in a Node.js middle-tier server, as shown previously. In the preceding JSON request, there is the intent.displayName field, which describes the type of question the user asked, and in the Node.js middle-tier server, you can use the intent name to process the request accordingly and respond to the user with the answer.

Furthermore, in the preceding JSON request, there is the queryResult.parameters[0].family field, which describes whose birthday the user was asking about. In NLP combined with ML, you can create a template for the question, allowing the machine to learn variations of possible questions that the user might ask. This is useful because there are many ways to ask about someone's birthday. For example, refer to the italicized words, which will create a template matching pattern for the machine to learn:

  • Do you know my son's birthday?
  • Can you tell me when my son's birthday is?
  • Tell me the birthday of my son.

VUI design platforms

VUI design platforms are those that can handle NLP. In this book, you will be introduced to two of the most popular NLP platforms: Google's Dialogflow and Amazon's Alexa Skills Kit (ASK). You will be using Dialogflow to build a chatbot, Google Assistant, Google Home, and Google Auto. Using ASK, you will learn how to build a voice application for Amazon Echo.

Principles of conversation

In this section, you will learn about the fundamentals of conversation, because a good conversation in VUIs can bring a compelling user experience to an application. You will utilize the design principles covered in this section to build a chatbot in Chapter 2, Building an FAQs Chatbot, home assistant devices such as Google Home in Chapter 3, Building a Fortune Cookie Application, Amazon Echo in Chapter 6, Building a Cooking Application Using Alexa, and a voice-enabled application for cars using Google Auto in Chapter 9, Building a Voice Enabled Podcast for the Car.

Turn taking

A conversation is all about taking turns when people are speaking to each other. There is one who speaks and there is one who listens, and during each turn, the listener should not interrupt the speaker. Look at a simple conversation with MyApp:

Me: Where is the nearest gas station?
MyApp: Here is what I found 1300 Coraopolist Heights Rd


A conversation can contain multiple turns, exchanging words on specific topics in a very short and concise manner. Avoid using one-dimensional script with predefined one-off questions and answers, where the conversation can quickly turn boring and unnatural. A good example of this is the system IVR, where the user has to wait and listen to all of "Press 1 or say reservation, Press 2 or say check reservation, Press 3 or say cancel reservation, Press # or say main menu." However, nowadays, the user would simply press 0000000000000, which typically sends the user to the operator. To give it a more conversational flow, you should try to anticipate the user's intention and engage with the user.

The following example shows the flow of conversation:

support: How can I help you today?
user: I need cancel my reservation.
support: Would you like to speak to the operator or would you like to do it over the phone?
user: phone
support: Do you have the reservation number?
user: no
support: Then i can look it up by your name and date of the reservation. What is your name and reservation date?

Having the conversation flow makes the user feel like they are having a conversation with a real person whereas in the IVR system the user did not engage with the user but presented all of the options, including options the user did not care about.


In order to create an effective conversation flow, you will need to have contextual knowledge. A context starts with simply asking about the weather. Then, I ask MyApp if I can go snowboarding. MyApp wishes that it would snow so that I could go snowboarding today. MyApp understood the context of my question—that I want to go snowboarding—and responded in a thoughtful way. As we begin to design VUIs, it will be critical to capture the context of the conversation in order to properly process the user's intent and deliver a correct and appropriate answer:

Me: What's weather like today?
MyApp: Cloudy with a chance of snow!
Me: Do you think I will be able to go snowboarding today?
MyApp: I hope it snows!

Verbal confirmation

It is important to let the user know that you have understood their question during the conversation. In the following conversation, note how MyApp responds with Let's see and Here you go before giving the answers to my questions to let me know that MyApp understood my questions:

Me: What time is it?
MyApp: Let's see. It is 3:14 PM.
Me: Do I have any appointment?
MyApp: Here you go. You have one at 5:00 PM.

How about when you are about to process business critical functions, such as ordering candy online? In the following conversation, I begin by ordering candy. Then, MyApp confirms my order by asking whether or not I really want to place an order. Upon saying Yes, MyApp lets me know that my order has been placed. This kind of confirmation reassures the user that their intent has been processed by the system. However, there is a downside to using this kind of confirmation because, if overused, the conversation can become repetitive and annoying to the user. In order to avoid this, we provide a confirmation response based on confidence. Confidence is used in ML in such a way that for every voice that is recognized, the matching algorithm gives a confidence level of how accurate the match is. In the case of the following example, MyApp has a confidence level of 90% that what I asked is about buying 10 candy in order to skip the question that asks whether I really want to buy the candy, and simply responds with a confirmation that my order has been placed. If the confidence level is below the threshold of 90%, MyApp can ask the confirmation question about whether I want to buy 10 candies and request further confirmation:

Me: I would like to order 10 M&M candies.
MyApp: Would you like to buy 10 M&M for $10?
Me: Yes
MyApp: Order has been processed.

Visual confirmation

Sometimes, verbal confirmation will not suffice, and we require visual confirmation to be presented to the user. For example, when asking Google for the nearest gas station, Google will not say out loud the exact address of the gas station, because it would not make sense at all. Instead, you will see that Google provides a clickable visual card and the address opens on Google Maps.

The following screenshot shows the visual card confirmation:

Visual response showing Google Maps

Error handling

When the errors occur in a voice application, the errors must be handled properly and the application must resume where it left off. Let's look at different types of errors in VUIs one at a time. First, imagine that you are talking to your friend and suddenly your friend stops talking in the middle of a sentence. Do you say nothing and wait for hours for your friend to talk and reengage in the conversation? Most likely, you would ask your friend if they are OK and try to reengage them in the conversation. The same is true for VUIs. If no response is provided in a given time frame, the voice application will need to recover and will try to reengage with the user. What if the voice application cannot understand the user's intent? In this case, the application would need to re-ask the question and ensure it doesn't blame the user in any way for any kind of error. Lastly, the user might continuously make the same mistakes in an attempt to provide information. In such cases, every new attempt to get the correct information from the user, the application needs to provide more detailed instruction as to how in hopes that the user can provide the correct intent.

Session awareness

Every time the user engages VUIs, there is specific data the application needs to remember. A good example of this would be the name of the user so that the application addresses the user by their first name. Another good example is the cooking voice application that you will build in Chapter 6, Building a Cooking Application Using Alexa. When a user is making use of cooking app, there are times when the application asks the user to saute the chopped onions till brown. Five minutes later, the user might come back and ask for the next step. The cooking application needs to be smart enough to understand that each step can take minutes to carry out and that the user's session must be maintained in order to correctly interact with the user.

Providing help

The VUIs do not have any visual representation, thus the user might not know what to do and ask for help. The application needs to visually and/or verbally provide help. When providing help, it's a good idea to use examples such as the following one, where, in order to see the photos on the phone, you have to say, Show my photos from last weekend.

The following screenshot shows an example of providing help for the user:

Providing help

Response time

When designing VUIs, the response time is a very important factor. A slow response time can jeopardize the user experience and the quality of the application. Note that here is a set limitation for Alexa and Google on how long the response time should be. For Alexa, the response must be received within 8 seconds, and for Google, the response must be received within 5 seconds. If the response from the backend server does not arrive within the given limit, Alexa and Google will send an exception back to the server. Consider factors that can affect responding to the user's intent, as follows:

  • Slow internet speed, which can cause delays for the users in terms of sending their verbal intent and receiving the response from the server
  • Slow backend server and database can cause slow latency issues when sending the instruction back to the application which in turn delays the responds sent to the users


When creating VUIs, you want to make the user feel as though they are talking to a real person. This develops empathy by making a connection with the user. First, empathy can be achieved by allowing the user to choose the voice they want to use. For example, in Google Assistant, the user can choose either a male or female voice by navigating to Settings | Preferences | Assistant voice, shown as follows:

Changing Google Assistant's voice

A second way is to programmatically control the voice using Speech Synthesis Markup Language (SSML),, which has been developed to generate synthetic voices for websites and other applications. With SSML, the response to the user can be controlled. Both Amazon Alexa and Google Dialogflow platforms support the use of SSML. Here are the most commonly used SSML tags and their usages in brief:

  • <break time="0.2s" />: Introduces a short pause between sentences.
  • <emphasis level="strong">Come here now!</emphasis>: Create speech that increases in volume and slows down, or decreases in volume and speeds up.
  • <prosody pitch="medium" rate="slow">great!!</prosody>: Used to customize the pitch, speech rate, and volume.
  • <p> Some paragraph goes here </p>: Similar to adding a long break between paragraphs.
  • <s> Some sentence goes here </s>: The equivalent of putting a period at the end to denote the end of the sentence in order to give a short pause.
  • <say-as interpret-as="cardinal">123</say-as>: Indicates the type of text. For example, the cardinal number 123 will be read as one hundred and twenty three. As for the ordinal number 123, it will be read as first, second, and third.
Both Amazon Alexa and Google Dialogflow support limited sets of SSML tags. Ensure that you check the SSML references for Amazon Alexa at and Google Dialogflow at You will learn more in greater detail in Chapter 2, Building an FAQs Chatbot.

Using SSML, let's create a speech that shows some excitement. You would not want the voice to be monotonous and boring. For example, sometimes you might want to show excitement, and to create such excitement, you can use prosody with a high pitch and slow rate, shown as follows. Also, when you emphasize the word love, you will be able to convey a sense of happiness. You can copy and paste the following SSML code at the Watson Text to Speech service interface, found at Enter the SSML and the voice will be played back:

OK <prosody pitch="medium" rate="slow">great!!</prosody>
<break time="0.2s" />
I <emphasis level="strong">love</emphasis> to see you tomorrow!

In order to test the SSML using, it is best to use either the Firefox or Chrome browser.