Book Image

Voicebot and Chatbot Design

By : Rachel Batish
Book Image

Voicebot and Chatbot Design

By: Rachel Batish

Overview of this book

We are entering the age of conversational interfaces, where we will interact with AI bots using chat and voice. But how do we create a good conversation? How do we design and build voicebots and chatbots that can carry successful conversations in in the real world? In this book, Rachel Batish introduces us to the world of conversational applications, bots and AI. You’ll discover how - with little technical knowledge - you can build successful and meaningful conversational UIs. You’ll find detailed guidance on how to build and deploy bots on the leading conversational platforms, including Amazon Alexa, Google Home, and Facebook Messenger. You’ll then learn key design aspects for building conversational UIs that will really succeed and shine in front of humans. You’ll discover how your AI bots can become part of a meaningful conversation with humans, using techniques such as persona shaping, and tone analysis. For successful bots in the real world, you’ll explore important use-cases and examples where humans interact with bots. With examples across finance, travel, and e-commerce, you’ll see how you can create successful conversational UIs in any sector. Expand your horizons further as Rachel shares with you her insights into cutting-edge voicebot and chatbot technologies, and how the future might unfold. Join in right now and start building successful, high impact bots!
Table of Contents (16 chapters)
Voicebot and Chatbot Design
Contributors
Preface
Other Book You May Enjoy
Index

The stack of conversational UI


The building blocks required to develop a modern and interactive conversational application include:

  • Speech recognition (for voicebots)

  • NLU

  • Conversational level:

    • Dictionary/samples

    • Context

    • Business logic

In this section, we will walk through the "journey" of a conversational interaction along the conversational stack.

Figure 10: The conversational stack: voice recognition, NLU, and context

Voice recognition technology

Voice recognition (also known as speech recognition or speech-to-text) transcribes voice into text. The computer captures our voice with a microphone and provides a text transcription of the words. Using a simple level of text processing, we can develop a voice control feature with simple commands, such as "turn left" or "call John." Leading providers of speech recognition today include Nuance, Amazon, IBM Watson, Google, Microsoft, and Apple.

NLU

To achieve a higher level of understanding, beyond simple commands, we must include a layer of NLU. NLU fulfills the task of reading comprehension. The computer "reads the text" (in a voicebot, it will be the transcribed text from the speech recognition) and then tries to grasp the user's intent behind it and translate it into concrete steps.

Lets take a look at travel bot, as an example. The system identifies two individual intentions:

  1. Flight booking – BookFlight

  2. Hotel booking – BookHotel

When a user asks to book a flight, the NLU layer is what helps the bot to understand that the intent behind the user's request is BookFlight. However, since people don't talk like computers, and since our goal is to create a humanized experience (and not a computerized one), the NLU layer should understand or be able to connect various requests to a specific intent.

Another example is when a user says, I need to fly to NYC. The NLU layer is expected to understand that the user's intent is to book a flight. A more complex request for our NLU to understand would be when a user says, I'm travelling again.

Similarly, the NLU should connect the user's sentence to the BookFlight intent. This is a much more complex task, since the bot can't identify the word flight in the sentence or a destination out of a list of cities or states. Therefore, the sentence is more difficult for the bot to understand.

Computer science considers NLU to be a "hard AI problem"(Turing Test as a Defining Feature of AI-Completeness in Artificial Intelligence, Evolutionary Computation and Metaheuristics (AIECM), Roman V. Yampolskiy), meaning that even with AI (powered by deep learning) developers are still struggling to provide a high-quality solution. To call a problem AI-hard means that this problem cannot be solved by a simple specific algorithm and that means dealing with unexpected circumstances while solving any real-world problem. In NLU, those unexpected circumstances are the various configurations of words and sentences in an endless number of languages and dialects. Some leading providers of NLU are Dialogflow (previously api.ai, acquired by Google), wit.ai (acquired by Facebook), Amazon, IBM Watson, and Microsoft.

Dictionaries/samples

To build a good NLU layer that can understand people, we must provide a broad and comprehensive sample set of concepts and categories in a subject area or domain. Simply put, we need to provide a list of associated samples or, even better, a collection of possible sentences for each single intent (request) that a user can activate on our bot. If we go back to our travel example, we would need to build a comprehensive dictionary, as you can see in the following table:

User says (samples)

Related intent

I want to book my travel

I want to book a flight

I need a flight

BookFlight

Please book a hotel room

I need accommodation

BookRoom

Building these dictionaries, or sets of samples, can be a tough and Sisyphean task. It is domain-specific and language-specific, and, as such, requires different configurations and tweaks from one use case to another, and from one language to another. Unlike the GUI, where the user is restricted to choosing from the web screen, the conversational UI is unique, since it offers the user an unlimited experience. However, as such, it is also very difficult to pre-configure to a level of perfection (see the AI-hard problem above). Therefore, the more samples we provide, the better the bot's NLU layer will be able to understand different requests from a user. Beware of the catch-22 in this case: the more intents we build, the more samples are required, and all those samples can easily lead to intents overlapping. For example, when a user says, I need help, they might mean they want to contact support, but they also might require help on how to use the app.

Context

Contextual conversation is one of the toughest challenges in conversational interaction. Being able to understand context is what makes a bot's interaction a humanized one. As mentioned previously, at its minimum, conversational UI is a series of questions and answers. However, adding a contextual aspect to it is what makes it a "true" conversational experience. By enabling context understanding, the bot can keep track of the conversation in its different stages and relate, and make a connection between, different requests. The entire flow of the conversation is taken into consideration and not just the last request.

In every conversational bot we build – either as a chatbot or a voicebot – the interaction will have two sides:

The end user will ask, Can I book a flight?

The bot will respond, Yes. The bot might also add, Do you want to fly international?

The end user can then approve this or respond by saying, No, domestic.

A contextual conversation is very different from a simple Q&A. For the preceding scenario, there were multiple different ways the user could have responded and the bot must be able to deal with all those different flows.

State machine

One methodology for dealing with different flows is to use a state machine methodology. This popular and simple way to describe context connects each state (phase) of the conversation to the next state, depending on the user's reaction.

Figure 11: Building contextual conversation using a state machine works better for simple use-cases and flows

However, the advantage of a state machine is also its disadvantage. This methodology forces us to map every possible conversational flow in advance. While it is very easy to use for building simple use cases, it is extremely difficult to understand and maintain over time, and it's impossible to use for more complicated flows (flight booking, for example, is a complex flow that can't be supported using a state machine). Another problem with the state machines method is that, even for simple use cases, to support multiple use cases with the same response, we still need to duplicate much of the work.

Figure 12: The disadvantage of using a state machine methodology when building complex flows

Event-driven contextual approach

The event-driven contextual approach is a more suitable method for today's conversational UI. It lets the users express themselves in an unlimited flow and doesn't force them through a specific flow. Understanding that it's impossible to map the entire conversational flow in advance, the event-driven contextual approach focuses on the context of the user's request to gather all the information it needs in an unstructured way by minimizing all other options.

Using this methodology, the user leads the conversation and the machine analyzes the data and completes the flow at the back. This method allows us to depart from the restricting GUI state machine flow and provide human-level interaction.

In this example, the machine knows that it needs the following parameters to complete a flight:

  • Departure location

  • Destination

  • Date

  • Airline

The user in this case can fluently say, I want to book a flight to NYC, or I want to fly from SF to NYC tomorrow, or I want to fly with Delta.

For each of these flows, the machine will return to the user to collect the missing information:

User says

Information bot collects

Information bot requests

User replies

I want to book a flight to NYC

Destination: NYC

Departure location

Date

Airline

Tomorrow, from SF with Delta

I want to fly from SF to NYC tomorrow

Departure: SF

Destination: NY

Date: Tomorrow

Airline

With Delta

I want to fly with Delta to NYC

Destination: NYC Airline: Delta

Departure location

Date

From NY, tomorrow

By building a conversational flow in an event-driven contextual approach, we succeed in mimicking our interaction with a human agent. When booking a flight with a travel agent, I start the conversation and provide the details that I know. The agent, in return, will ask me only for the missing details and won't force me to state each detail at a certain time.

Business logic/dynamic data

At this stage, I think we can agree that building a conversational UI is not an easy task. In fact, many bots today don't use NLU and avoid free-speech interaction. We had great expectations of chatbots and with those high expectations came a great disappointment. This is why many chatbots and voicebots today provide mostly simple Q&A flows.

Figure 13: The Michael Kors Facebook Messenger bot: conversational UI minimized to a simple Q&A flow with no contextuality

Most of those bots have a limited offering and the business logic is connected to two-to-three specific use cases, such as opening hours or a phone number, no matter what the user is asking for. In other very popular chat interfaces, bots are still leaning on the GUI, offering a menu selection and eliminating free text.

Figure 14: The Michael Kors Facebook Messenger bot: forcing graphic UI on conversational UI mediums

However, if we are building a true conversational communication between our bot and our users, we must make sure that we connect it to a dynamic business logic. So, after we have enabled speech recognition, worked on our NLU, built samples, and developed an event-driven contextual flow, it is time to connect our bot to dynamic data. To reach real-time data, and to be able to run transactions, our bot needs to connect to the business logic of our application. This can be done through the usage of APIs to your backend systems.

Going back to our flight booking bot, we would need to retrieve real-time data on when the next flight from SF to NYC is, what seats are still available, and what the price is for the flight. Our APIs can also help us to complete the order and approve a payment. If you are lacking APIs for some of the needed data and functions, you can develop new ones or use screen-scraping techniques to avoid a complex development.