Book Image

Natural Language Processing with TensorFlow - Second Edition

By : Thushan Ganegedara
2 (1)
Book Image

Natural Language Processing with TensorFlow - Second Edition

2 (1)
By: Thushan Ganegedara

Overview of this book

Learning how to solve natural language processing (NLP) problems is an important skill to master due to the explosive growth of data combined with the demand for machine learning solutions in production. Natural Language Processing with TensorFlow, Second Edition, will teach you how to solve common real-world NLP problems with a variety of deep learning model architectures. The book starts by getting readers familiar with NLP and the basics of TensorFlow. Then, it gradually teaches you different facets of TensorFlow 2.x. In the following chapters, you then learn how to generate powerful word vectors, classify text, generate new text, and generate image captions, among other exciting use-cases of real-world NLP. TensorFlow has evolved to be an ecosystem that supports a machine learning workflow through ingesting and transforming data, building models, monitoring, and productionization. We will then read text directly from files and perform the required transformations through a TensorFlow data pipeline. We will also see how to use a versatile visualization tool known as TensorBoard to visualize our models. By the end of this NLP book, you will be comfortable with using TensorFlow to build deep learning models with many different architectures, and efficiently ingest data using TensorFlow Additionally, you’ll be able to confidently use TensorFlow throughout your machine learning workflow.
Table of Contents (15 chapters)
12
Other Books You May Enjoy
13
Index

The traditional approach to Natural Language Processing

The traditional or classical approach to solving NLP is a sequential flow of several key steps, and it is a statistical approach. When we take a closer look at a traditional NLP learning model, we will be able to see a set of distinct tasks taking place, such as preprocessing data by removing unwanted data, feature engineering to get good numerical representations of textual data, learning to use machine learning algorithms with the aid of training data, and predicting outputs for novel, unseen data. Of these, feature engineering was the most time-consuming and crucial step for obtaining good performance on a given NLP task.

Understanding the traditional approach

The traditional approach to solving NLP tasks involves a collection of distinct subtasks. First, the text corpora need to be preprocessed, focusing on reducing the vocabulary and distractions.

By distractions, I refer to the things that distract the algorithm (for example, punctuation marks and stop word removal) from capturing the vital linguistic information required for the task.

Next come several feature engineering steps. The main objective of feature engineering is to make learning easier for the algorithms. Often the features are hand-engineered and biased toward the human understanding of a language. Feature engineering was of the utmost importance for classical NLP algorithms, and consequently, the best-performing systems often had the best-engineered features. For example, for a sentiment classification task, you can represent a sentence with a parse tree and assign positive, negative, or neutral labels to each node/subtree in the tree to classify that sentence as positive or negative. Additionally, the feature engineering phase can use external resources such as WordNet (a lexical database that can provide insights into how different words are related to each other – e.g. synonyms) to develop better features. We will soon look at a simple feature engineering technique known as bag-of-words.

Next, the learning algorithm learns to perform well at the given task using the obtained features and, optionally, the external resources. For example, for a text summarization task, a parallel corpus containing common phrases and succinct paraphrases would be a good external resource. Finally, prediction occurs. Prediction is straightforward, where you will feed a new input and obtain the predicted label by forwarding the input through the learning model. The entire process of the traditional approach is depicted in Figure 1.2:

C:\Users\gauravg\Desktop\14070\CH01\B08681_01_02.png

Figure 1.2: The general approach of classical NLP

Next, let’s discuss a use case where we use NLP to generate football game summaries.

Example – generating football game summaries

To gain an in-depth understanding of the traditional approach to NLP, let’s consider a task of automatic text generation from the statistics of a game of football. We have several sets of game statistics (for example, the score, penalties, and yellow cards) and the corresponding articles generated for that game by a journalist, as the training data. Let’s also assume that for a given game, we have a mapping from each statistical parameter to the most relevant phrase of the summary for that parameter. Our task here is that, given a new game, we need to generate a natural-looking summary of the game. Of course, this can be as simple as finding the best-matching statistics for the new game from the training data and retrieving the corresponding summary. However, there are more sophisticated and elegant ways of generating text.

If we were to incorporate machine learning to generate natural language, a sequence of operations, such as preprocessing the text, feature engineering, learning, and prediction, is likely to be performed.

Preprocessing: The text involves operations, such as tokenization (for example, splitting “I went home” into “I”, “went”, “home”), stemming (for example, converting listened to listen), and removing punctuation (for example, ! and ;), in order to reduce the vocabulary (that is, the features), thus reducing the dimensionality of the data. Tokenization might appear trivial for a language such as English, as the words are isolated; however, this is not the case for certain languages such as Thai, Japanese, and Chinese, as these languages are not consistently delimited. Next, it is important to understand that stemming is not a trivial operation either. It might appear that stemming is a simple operation that relies on a simple set of rules such as removing ed from a verb (for example, the stemmed result of listened is listen); however, it requires more than a simple rule base to develop a good stemming algorithm, as stemming certain words can be tricky (for example, using rule-based stemming, the stemmed result of argued is argu). In addition, the effort required for proper stemming can vary in complexity for other languages.

Feature engineering is used to transform raw text data into an appealing numerical representation so that a model can be trained on that data, for example, converting text into a bag-of-words representation or using n-gram representation, which we will discuss later. However, remember that state-of-the-art classical models rely on much more sophisticated feature engineering techniques.

The following are some of the feature engineering techniques:

Bag-of-words: This is a feature engineering technique that creates feature representations based on the word occurrence frequency. For example, let’s consider the following sentences:

  • Bob went to the market to buy some flowers
  • Bob bought the flowers to give to Mary

The vocabulary for these two sentences would be:

[“Bob”, “went”, “to”, “the”, “market”, “buy”, “some”, “flowers”, “bought”, “give”, “Mary”]

Next, we will create a feature vector of size V (vocabulary size) for each sentence, showing how many times each word in the vocabulary appears in the sentence. In this example, the feature vectors for the sentences would respectively be as follows:

[1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0]

[1, 0, 2, 1, 0, 0, 0, 1, 1, 1, 1]

A crucial limitation of the bag-of-words method is that it loses contextual information as the order of words is no longer preserved.

n-gram: This is another feature engineering technique that breaks down text into smaller components consisting of n letters (or words). For example, 2-gram would break the text into two-letter (or two-word) entities. For example, consider this sentence:

Bob went to the market to buy some flowers

The letter level n-gram decomposition for this sentence is as follows:

[“Bo”, “ob”, “b “, “ w”, “we”, “en”, ..., “me”, “e “,” f”, “fl”, “lo”, “ow”, “we”, “er”, “rs”]

The word-based n-gram decomposition is this:

[“Bob went”, “went to”, “to the”, “the market”, ..., “to buy”, “buy some”, “some flowers”]

The advantage in this representation (letter level) is that the vocabulary will be significantly smaller than if we were to use words as features for large corpora.

Next, we need to structure our data to be able to feed it into a learning model. For example, we will have data tuples of the form (a statistic, a phrase explaining the statistic) as follows:

Total goals = 4, “The game was tied with 2 goals for each team at the end of the first half”

Team 1 = Manchester United, “The game was between Manchester United and Barcelona”

Team 1 goals = 5, “Manchester United managed to get 5 goals”

The learning process may comprise three sub-modules: a Hidden Markov Model (HMM), a sentence planner, and a discourse planner. An HMM is a recurrent model that can be used to solve time-series problems. For example, generating text is a time-series problem as the order of generated words matters. In our example, an HMM might learn to model language (i.e. generate meaningful text) by training on a corpus of statistics and related phrases. We will train the HMM so that it produces a relevant sequence of text, given the statistics as the starting input. Once trained, the HMM can be used for inference in a recursive manner, where we start with a seed (e.g. a statistic) and predict the first word of the description, then use the predicted word to generate the next word, and so on.

Next, we can have a sentence planner that corrects any syntactical or grammatical errors, that might have been introduced by the model. For example, a sentence planner might take in the phrase, I go house and output I go home. For this, it can use a database of rules, which contains the correct way of conveying meanings, such as the need for a preposition between a verb and the word house.

Using the HMM and the sentence planner, we will have syntactically grammatically correct sentences. Then, we need to collate these phrases in such a way that the essay made from the collection of phrases is human readable and flows well. For example, consider the three phrases, Player 10 of the Barcelona team scored a goal in the second half, Barcelona played against Manchester United, and Player 3 from Manchester United got a yellow card in the first half; having these sentences in this order does not make much sense. We like to have them in this order: Barcelona played against Manchester United, Player 3 from Manchester United got a yellow card in the first half, and Player 10 of the Barcelona team scored a goal in the second half. To do this, we use a discourse planner; discourse planners can organize a set of messages so that the meaning of them is conveyed properly.

Now, we can get a set of arbitrary test statistics and obtain an essay explaining the statistics by following the preceding workflow, which is depicted in Figure 1.3:

C:\Users\gauravg\Desktop\14070\CH01\B08681_01_03.png

Figure 1.3: The classical approach to solving a language modeling task

Here, it is important to note that this is a very high-level explanation that only covers the main general-purpose components that are most likely to be included in traditional NLP. The details can largely vary according to the particular application we are interested in solving. For example, additional application-specific crucial components might be needed for certain tasks (a rule base and an alignment model in machine translation). However, in this book, we do not stress about such details as the main objective here is to discuss more modern ways of Natural Language Processing.

Drawbacks of the traditional approach

Let’s list several key drawbacks of the traditional approach as this would lay a good foundation for discussing the motivation for deep learning:

  • The preprocessing steps used in traditional NLP forces a trade-off of potentially useful information embedded in the text (for example, punctuation and tense information) in order to make the learning feasible by reducing the vocabulary. Though preprocessing is still used in modern deep-learning-based solutions, it is not as crucial for them as it is for the traditional NLP workflow due to the large representational capacity of deep networks and their ability to optimize high-end hardware like GPUs.
  • Feature engineering is very labor-intensive. In order to design a reliable system, good features need to be devised. This process can be very tedious as different feature spaces need to be extensively explored and evaluated. Additionally, in order to effectively explore robust features, domain expertise is required, which can be scarce and expensive for certain NLP tasks.
  • Various external resources are needed for it to perform well, and there are not many freely available ones. Such external resources often consist of manually created information stored in large databases. Creating one for a particular task can take several years, depending on the severity of the task (for example, a machine translation rule base).

Now, let’s discuss how deep learning can help to solve NLP problems.