Book Image

Conversational AI with Rasa

By : Xiaoquan Kong, Guan Wang
Book Image

Conversational AI with Rasa

By: Xiaoquan Kong, Guan Wang

Overview of this book

The Rasa framework enables developers to create industrial-strength chatbots using state-of-the-art natural language processing (NLP) and machine learning technologies quickly, all in open source. Conversational AI with Rasa starts by showing you how the two main components at the heart of Rasa work – Rasa NLU (natural language understanding) and Rasa Core. You'll then learn how to build, configure, train, and serve different types of chatbots from scratch by using the Rasa ecosystem. As you advance, you'll use form-based dialogue management, work with the response selector for chitchat and FAQ-like dialogs, make use of knowledge base actions to answer questions for dynamic queries, and much more. Furthermore, you'll understand how to customize the Rasa framework, use conversation-driven development patterns and tools to develop chatbots, explore what your bot can do, and easily fix any mistakes it makes by using interactive learning. Finally, you'll get to grips with deploying the Rasa system to a production environment with high performance and high scalability and cover best practices for building an efficient and robust chat system. By the end of this book, you'll be able to build and deploy your own chatbots using Rasa, addressing the common pain points encountered in the chatbot life cycle.
Table of Contents (16 chapters)
1
Section 1: The Rasa Framework
5
Section 2: Rasa in Action
11
Section 3: Best Practices

Introduction to Natural Language Processing (NLP)

NLP is a subfield of linguistics and ML, concerned with interactions between computers and humans via text or speech.

Let's start with a brief history of NLP.

Evolution of modern NLP

Before 2013, there was no unified method for NLP. This was because two problems had not been solved well.

The first problem relates to how we represent textual information during the computing process.

Time-series data such as voices can be represented as signals and waves. Image information gives pixel position and pixel value. However, there were no intuitive ways to digitalize text. There were some preliminary methods such as one-hot encoding to represent each word or phrase and use BoW to represent sentences and paragraphs, but it became quite obvious that this was not the perfect way to deal with this.

After one-hot encoding, the dimension of each vector will be the size of the entire vocabulary, with all 0 values except one value of 1, to represent the position of that word. Such sparse vectors waste a lot of space and, in the meantime, give no indication of the semantic meaning of the word itself—every pair of two different words will always be orthogonal to each other.

A BoW model simply counts the frequency of each word that appears in the text and ignores the dependency and order of the words in the context.

The second problem relates to how we can build models for text.

Traditional methods rely heavily on manually engineered features—for example, we use Term Frequency-Inverse Document Frequency (TF-IDF) to represent the importance of a word with respect to its frequency in both an article and a whole group of articles. We use topic modeling to inform us of the document theme and ratio of different themes for each article with respect to statistical information. We also use lots of linguistic information to manually engineer features.

Let's take an example from an open source tool called IEPY that is used for relation extraction. Here is a list of the engineered features of IEPY constructs for its relation extraction task:

  • number_of_tokens
  • symbols_in_between
  • in_same_sentence
  • verbs_count
  • verbs_count_in_between
  • total_number_of_entities
  • other_entities_in_between
  • entity_distance
  • entity_order
  • bag_of_wordpos_bigrams_in_between
  • bag_of_wordpos_in_between
  • bag_of_word_bigrams_in_between
  • bag_of_pos_in_between
  • bag_of_words_in_between
  • bag_of_wordpos_bigrams
  • bag_of_wordpos
  • bag_of_word_bigrams
  • bag_of_pos
  • bag_of_words

After getting all those features, traditional methods use some traditional ML algorithms to build models. Let's take IEPY as an example again. It provides the following classification models:

  • Stochastic Gradient Descent (SGD)
  • Nearest Neighbors (NN)
  • Support Vector Classification (SVC)
  • Random Forest (RF)
  • Adaptive Boosting (AdaBoost)

Traditional applications of NLP usually practice in a very similar way to that shown previously to solve real problems. We will see later that Rasa solves the entity recognition (ER) problem in a similar way. The advantage is that the training process can be really fast, and it requires less label data to train a working model. However, this also means that we need to spend a lot of time and effort manually engineering the features and tuning the models. It also does not work well for more complicated contexts.

In 2013, Tomas Mikolov published two research papers that introduced Continuous BoW (CBOW) and Skip-gram models. Soon after that, an open source tool called word2vec was released.

word2vec solves the main issue of our first problem in an elegant way, training itself through a shallow neural network with a large text corpus. By looking at the context for each of the words, the algorithm embeds the semantic meaning of each word into a strong and mysterious dense vector—a so-called word embedding. The vector is strong because the word embedding embeds the semantic meaning of the word itself so that we can even do operations such as King - Man + Woman = Queen that were unimaginable before with one-hot encoding. It is also mysterious because we still do not fully understand what it means for the value in each dimension of the word embedding.

This basically started a new era for NLP. With word2vec, the first step for NLP is normally to transform the words into word embeddings. With the help of word embeddings, the popular deep learning (DL) model in computer vision can also be applied to text. This is becoming popular and is gradually replacing traditional ML models. This solves our second question on how to model the texts. With word embeddings trained on a large corpus, being the input and deep neural networks (DNNs) as the model, this new pipeline became standard for many NLP tasks.

The invention of word2vec and word embeddings converted the one-hot encoding of words into vectors that are dense, mysterious, elegant, and expressive. It freed NLP from complicated and tedious linguistic features and pushed techniques such as DL to be applied to the NLP domain. This trend of representation learning has gone beyond NLP and into applications such as knowledge graphs (with graph embeddings) and recommendation systems (with user embeddings and item embeddings).

Although word2vec significantly improved NLP tasks, researchers soon discovered its shortcomings: in reality, the same word has different meanings in different contexts (for example, the word "bank" in "riverbank" and "financial bank" would have different embeddings), but the vector representation given by word2vec is static regardless of the context. So, why don't we give an embedding of a word based on the current context? This new technology is known as contextualized word embeddings. Among the early models that introduced contextualized word embeddings is the famous Embeddings from Language Models (ELMo). ELMo does not use fixed embeddings for each word but looks at the entire sentence before assigning embeddings to each word. It uses a bi-directional long short-term memory (LSTM) trained on a specific task to create these embeddings. LSTM is a special recurrent neural network (RNN) that can learn long-term dependencies (the large distance between the relevant information and the point where it is needed). It performs well on various problems and has become a core component of the NLP algorithm based on DL.

The Transformer (https://arxiv.org/abs/1706.03762) model was released in 2017, and it performed amazing results on machine translation tasks. Transformer does not use LSTM in architecture but instead uses a lot of attention mechanisms. An attention mechanism is a function that maps a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weight of each value is computed by a function of the query and the corresponding key of the value. Some NLP researchers believe that the attention mechanism used in Transformer is a better alternative to LSTM. They believe that the attention mechanism handles long-term dependencies better than LSTM and has very promising and broad application prospects. Transformer adopts an encoder-decoder structure in the architecture. The encoder and decoder are highly similar in structure but not the same in their function. The encoder is composed of a stack of N identical encoder layers. The decoder is also composed of a stack of N identical decoder layers. Both the encoder layer and the decoder layer use the attention mechanism as the core component.

The great success of Transformer has attracted the interest of many NLP scientists. They have developed more excellent models based on Transformer. Among these models, two are very famous and important: Generative Pre-trained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT). GPT is entirely composed of Transformer's decoder layer, while BERT is entirely composed of Transformer's encoder layer. The goal of GPT is to produce human-like text. So far, GPT has developed three versions—namely, GPT-1, GPT-2, and GPT-3. The quality of the text generated by GPT-3 is very high—very close to a human level. The goal of BERT is to provide a better language representation to help a wide range of downstream tasks (sentence-pair classification tasks, single-sentence classification tasks, question-answering (QA) tasks, single-sentence tagging tasks) achieve better results. That year, the BERT model achieved state of the art on various NLP tasks and greatly improved the existing industry's best record on many tasks. Now, BERT has derived a large family tree, among which the more well-known ones are XLNet, RoBERTa, ALBERT, ELECTRA, ERNIE, BERT-WWM, and DistilBERT.

We have now learned the evolution process of modern NLP. In the next section, we will discuss the different types of tasks in NLP.

Basic tasks of NLP

The highly efficient embedding representations of words, phrases, and sentences reduce the heavy workload on feature engineering and open the door for a series of downstream NLP applications.

If we consider texts as sequences and different kinds of labels as categories, then the basic tasks of NLP can be categorized into the following groups with regard to the I/O data structures:

  • From categories to sequences: Examples include text generation and picture-caption generation.
  • From sequences to categories: Examples include text classification, sentiment analysis, and relation extraction. If the goal of text classification is to classify text according to the intent of the text, this is an intent classification task. An intent classification task is one of two important parts of natural language understanding (NLU), which will be introduced in the next section. The common sequences-to-categories algorithms include TextCNN, TextRNN, Transformers, and their variants. Although different algorithms have different structures, in general, a sequences-to-categories algorithm extracts the semantics of the sequence (the text) into a vector and then classifies the vector into categories.
  • Synchronous sequence to sequence (Seq2Seq): Examples include tokenization, part-of-speech (POS) tagging, semantic role labeling, and named ER (NER). NER is another important part of NLU besides intention classification. The common synchronous Seq2Seq algorithms include Conditional Random Fields (CRF), Bidirectional LSTM (BiLSTM)-CRF, Transformers, and their variants. Although the various algorithms work differently, the most common and classic algorithms in production are based on sequence annotation—that is, each element in the sequence is classified one by one, and finally, the classification results of all elements are combined into another sequence.
  • Asynchronous Seq2Seq: Examples include machine translation, automatic summarization, and keyboard input methods.

We will see that in building chatbots, the intention-recognition task is a sequence-to-category task, while ER is a synchronous Seq2Seq task. Automatic speech recognition (ASR) can be generally considered as a synchronous sequence (voice signals) to sequence (text) task, and so is Text to Speech (TTS), but from text-to-voice signals. Dialogue management (DM) can be generally considered as an asynchronous sequence (conversation history) to category (next action) task.

Let's talk more about chatbots.