Book Image

Transformers for Natural Language Processing

By : Denis Rothman

Book Image

Transformers for Natural Language Processing

By: Denis Rothman

Overview of this book

The transformer architecture has proved to be revolutionary in outperforming the classical RNN and CNN models in use today. With an apply-as-you-learn approach, Transformers for Natural Language Processing investigates in vast detail the deep learning for machine translations, speech-to-text, text-to-speech, language modeling, question answering, and many more NLP domains with transformers. The book takes you through NLP with Python and examines various eminent models and datasets within the transformer architecture created by pioneers such as Google, Facebook, Microsoft, OpenAI, and Hugging Face. The book trains you in three stages. The first stage introduces you to transformer architectures, starting with the original transformer, before moving on to RoBERTa, BERT, and DistilBERT models. You will discover training methods for smaller transformers that can outperform GPT-3 in some cases. In the second stage, you will apply transformers for Natural Language Understanding (NLU) and Natural Language Generation (NLG). Finally, the third stage will help you grasp advanced language understanding techniques such as optimizing social network datasets and fake news identification. By the end of this NLP book, you will understand transformers from a cognitive science perspective and be proficient in applying pretrained transformer models by tech giants to various datasets.

Preface

Who this book is for

What this book covers

To get the most out of this book

Getting Started with the Model Architecture of the Transformer

Getting Started with the Model Architecture of the Transformer

The background of the Transformer

The rise of the Transformer: Attention Is All You Need

Training and performance

Free Chapter

Fine-Tuning BERT Models

Fine-Tuning BERT Models

The architecture of BERT

Fine-tuning BERT

Pretraining a RoBERTa Model from Scratch

Pretraining a RoBERTa Model from Scratch

Training a tokenizer and pretraining a transformer

Building KantaiBERT from scratch

Downstream NLP Tasks with Transformers

Downstream NLP Tasks with Transformers

Transduction and the inductive inheritance of transformers

Transformer performances versus Human Baselines

Running downstream tasks

Machine Translation with the Transformer

Machine Translation with the Transformer

Defining machine translation

Preprocessing a WMT dataset

Evaluating machine translation with BLEU

Translations with Trax

Text Generation with OpenAI GPT-2 and GPT-3 Models

Text Generation with OpenAI GPT-2 and GPT-3 Models

The rise of billion-parameter transformer models

Transformers, reformers, PET, or GPT?

It's time to make a decision

The architecture of OpenAI GPT models

Text completion with GPT-2

Training a GPT-2 language model

Context and completion examples

Generating music with transformers

Applying Transformers to Legal and Financial Documents for AI Text Summarization

Applying Transformers to Legal and Financial Documents for AI Text Summarization

Designing a universal text-to-text model

Text summarization with T5

Matching Tokenizers and Datasets

Matching Tokenizers and Datasets

Matching datasets and tokenizers

Standard NLP tasks with specific vocabulary

T5 Bill of Rights Sample

Semantic Role Labeling with BERT-Based Transformers

Semantic Role Labeling with BERT-Based Transformers

Getting started with SRL

SRL experiments with the BERT-based model

Difficult samples

Let Your Data Do the Talking: Story, Questions, and Answers

Let Your Data Do the Talking: Story, Questions, and Answers

Method 0: Trial and error

Method 1: NER first

Method 2: SRL first

Detecting Customer Emotions to Make Predictions

Detecting Customer Emotions to Make Predictions

Getting started: Sentiment analysis transformers

The Stanford Sentiment Treebank (SST)

Predicting customer behavior with sentiment analysis

Analyzing Fake News with Transformers

Analyzing Fake News with Transformers

Emotional reactions to fake news

A rational approach to fake news

Other Books You May Enjoy

Other Books You May Enjoy

Index

Appendix: Answers to the Questions

Chapter 1, Getting Started with the Model Architecture of the Transformer

Chapter 2, Fine-Tuning BERT Models

Chapter 3, Pretraining a RoBERTa Model from Scratch

Chapter 4, Downstream NLP Tasks with Transformers

Chapter 5, Machine Translation with the Transformer

Chapter 6, Text Generation with OpenAI GPT-2 and GPT-3 Models

Chapter 7, Applying Transformers to Legal and Financial Documents for AI Text Summarization

Chapter 8, Matching Tokenizers and Datasets

Chapter 9, Semantic Role Labeling with BERT-Based Transformers

Chapter 10, Let Your Data Do the Talking: Story, Questions, and Answers

Chapter 11, Detecting Customer Emotions to Make Predictions

Chapter 12, Analyzing Fake News with Transformers

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

The architecture of BERT

BERT introduces bidirectional attention to transformer models. Bidirectional attention requires many other changes to the original Transformer model.

We will not go through the building blocks of transformers described in Chapter 1, Getting Started with the Model Architecture of the Transformer. You can consult Chapter 1 at any time to review an aspect of the building blocks of transformers. In this section, we will focus on the specific aspects of BERT models.

We will focus on the evolutions designed by Devlin et al. (2018), which describe the encoder stack.

We will first go through the encoder stack, then the preparation of the pretraining input environment. Then we will describe the two-step framework of BERT: pretraining and fine-tuning.

Let's first explore the encoder stack.

The encoder stack

The first building block we will take from the original Transformer model is an encoder layer. The encoder layer as described in Chapter 1, Getting Started with the Model Architecture of the Transformer, is shown in Figure 2.1:

Figure 2.1: The encoder layer

The BERT model does not use decoder layers. A BERT model has an encoder stack but no decoder stacks. The masked tokens (hiding the tokens to predict) are in the attention layers of the encoder, as we will see when we zoom into a BERT encoder layer in the following sections.

The original Transformer contains a stack of N=6 layers. The number of dimensions of the original Transformer is d_model = 512. The number of attention heads of the original Transformer is A=8. The dimensions of a head of the original Transformer is:

BERT encoder layers are larger than the original Transformer model.

Two BERT models can be built with the encoder layers:

BERT_BASE, which contains a stack of N=12 encoder layers. d_model = 768 and can also be expressed as H=768, as in the BERT paper. A multi-head attention sub-layer contains A=12 heads. The dimensions of each head z_A remains 64 as in the original Transformer model:

The output of each multi-head attention sub-layer before concatenation will be the output of the 12 heads:

output_multi-head_attention={z₀, z₁, z₂,…,z₁₁}

BERT_LARGE, which contains a stack of N=24 encoder layers. d_model = 1024. A multi-head attention sub-layer contains A=16 heads. The dimensions of each head z_A also remains 64 as in the original Transformer model:

The output of each multi-head attention sub-layer before concatenation will be the output of the 16 heads:

output_multi-head_attention={z₀, z₁, z₂,…,z₁₅}

The sizes of the models can be summed up as follows:

Figure 2.2: Transformer models

Size and dimensions play an essential role in BERT-style pretraining. BERT models are like humans. BERT models produce better results with more working memory (dimensions), and more knowledge (data). Large transformer models that learn large amounts of data will pretrain better for downstream NLP tasks.

Let's now go to the first sub-layer and see the fundamental aspects of input embedding and positional encoding in a BERT model.

Preparing the pretraining input environment

The BERT model has no decoder stack of layers. As such, it does not have a masked multi-head attention sub-layer. BERT goes further and states that a masked multi-head attention layer that masks the rest of the sequence impedes the attention process.

A masked multi-head attention layer masks all of the tokens that are beyond the present position. For example, take the following sentence:

The cat sat on it because it was a nice rug.

If we have just reached the word "it," the input of the encoder could be:

The cat sat on it<masked sequence>

The motivation of this approach is to prevent the model from seeing the output it is supposed to predict. This left-to-right approach produces relatively good results.

However, the model cannot learn much this way. To know what "it" refers to, we need to see the whole sentence to reach the word "rug" and figure out that "it" was the rug.

The authors of BERT came up with an idea. Why not pretrain the model to make predictions using a different approach?

The authors of BERT came up with bidirectional attention, letting an attention head attend to all of the words both from left to right and right to left. In other words, the self-attention mask of an encoder could do the job without being hindered by the masked multi-head attention sub-layer of the decoder.

The model was trained with two tasks. The first method is Masked Language Modeling (MLM). The second method is Next Sentence Prediction (NSP).

Let's start with masked language modeling.

Masked language modeling

Masked language modeling does not require training a model with a sequence of visible words followed by a masked sequence to predict.

BERT introduces the bidirectional analysis of a sentence with a random mask on a word of the sentence.

It is important to note that BERT applies WordPiece, a sub-word segmentation method, tokenization to the inputs. It also uses learned positional encoding, not the sine-cosine approach.

A potential input sequence could be:

"The cat sat on it because it was a nice rug."

The decoder would mask the attention sequence after the model reached the word "it":

"The cat sat on it <masked sequence>."

But the BERT encoder masks a random token to make a prediction:

"The cat sat on it [MASK] it was a nice rug."

The multi-attention sub-layer can now see the whole sequence, run the self-attention process, and predict the masked token.

The input tokens were masked in a tricky way to force the model to train longer but produce better results with three methods:

Surprise the model by not masking a single token on 10% of the dataset; for example:
```
"The cat sat on it [because] it was a nice rug."
```

Surprise the model by replacing the token with a random token on 10% of the dataset; for example:
```
"The cat sat on it [often] it was a nice rug."
```

Replace a token by a [MASK] token on 80% of the dataset; for example:
```
"The cat sat on it [MASK] it was a nice rug."
```

The authors' bold approach avoids overfitting and forces the model to train efficiently.

BERT was also trained to perform next sentence prediction.

Next sentence prediction

The second method found to train BERT is Next Sentence Prediction (NSP). The input contains two sentences.

Two new tokens were added:

[CLS] is a binary classification token added to the beginning of the first sequence to predict if the second sequence follows the first sequence. A positive sample is usually a pair of consecutive sentences taken from a dataset. A negative sample is created using sequences from different documents.
[SEP] is a separation token that signals the end of a sequence.

For example, the input sentences taken out of a book could be:

"The cat slept on the rug. It likes sleeping all day."

These two sentences would become one input complete sequence:

[CLS] the cat slept on the rug [SEP] it likes sleep ##ing all day[SEP]

This approach requires additional encoding information to distinguish sequence A from sequence B.

If we put the whole embedding process together, we obtain:

Figure 2.3: Input embeddings

The input embeddings are obtained by summing the token embeddings, the segment (sentence, phrase, word) embeddings, and the positional encoding embeddings.

The input embedding and positional encoding sub-layer of a BERT model can be summed up as follows:

A sequence of words is broken down into WordPiece tokens.
A [MASK] token will randomly replace the initial word tokens for masked language modeling training.
A [CLS] classification token is inserted at the beginning of a sequence for classification purposes.
A [SEP] token separates two sentences (segments, phrases) for NSP training.
Sentence embedding is added to token embedding, so that sentence A has a different sentence embedding value than sentence B.
Positional encoding is learned. The sine-cosine positional encoding method of the original Transformer is not applied.

Some additional key features are:

BERT uses bidirectional attention in all of its multi-head attention sub-layers, opening vast horizons of learning and understanding relationships between tokens.
BERT introduces scenarios of unsupervised embedding, pretraining models with unlabeled text. This forces the model to think harder during the multi-head attention learning process. This makes BERT able to learn how languages are built and apply this knowledge to downstream tasks without having to pretrain each time.
BERT also uses supervised learning, covering all bases in the pretraining process.

BERT has improved the training environment of transformers. Let's now see the motivation of pretraining and how it helps the fine-tuning process.

Pretraining and fine-tuning a BERT model

BERT is a two-step framework. The first step is the pretraining, and the second is fine-tuning, as shown in Figure 2.4:

Figure 2.4: The BERT framework

Training a transformer model can take hours, if not days. It takes quite some time to engineer the architecture and parameters, and select the proper datasets to train a transformer model.

Pretraining is the first step of the BERT framework that can be broken down into two sub-steps:

Defining the model's architecture: number of layers, number of heads, dimensions, and the other building blocks of the model
Training the model on Masked Language Modeling (MLM) and NSP tasks

The second step of the BERT framework is fine-tuning, which can also be broken down into two sub-steps:

Initializing the downstream model chosen with the trained parameters of the pretrained BERT model
Fine-tuning the parameters for specific downstream tasks such as Recognizing Textual Entailment (RTE), Question Answering (SQuAD v1.1, SQuAD v2.0), and Situations With Adversarial Generations (SWAG)

In this section, we covered the information we need to fine-tune a BERT model. In the following chapters, we will explore the topics we brought up in this section in more depth:

In Chapter 3, Pretraining a RoBERTa Model from Scratch, we will pretrain a BERT-like model from scratch in 15 steps. We will even compile our own data, train a tokenizer, and then train the model. The goal of this chapter is to first go through the specific building blocks of BERT and then fine-tune an existing model.
In Chapter 4, Downstream NLP Tasks with Transformers, we will go through many downstream NLP tasks, exploring GLUE, SQuAD v1.1, SQuAD, SWAG, BLEU, and several other NLP evaluation datasets. We will run several downstream transformer models to illustrate key tasks. The goal of this chapter is to fine-tune a downstream model.
In Chapter 6, Text Generation with OpenAI GPT-2 and GPT-3 Models, we will explore the architecture and usage of Open AI GPT, GPT-2, and GPT-3 transformers. BERT_BASE was configured to be close to OpenAI GPT to show that it produced better performance. However, OpenAI transformers keep evolving too! We will see how.

In this chapter, the BERT model we will fine-tune will be trained on The Corpus of Linguistic Acceptability (CoLA). The downstream task is based on Neural Network Acceptability Judgments by Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman.

We will fine-tune a BERT model that will determine the grammatical acceptability of a sentence. The fine-tuned model will have acquired a certain level of linguistic competence.

We have gone through BERT architecture and its pretraining and fine-tuning framework. Let's now fine-tune a BERT model.