The traditional approach to Natural Language Processing
The traditional or classical approach to solving NLP is a sequential flow of several key steps, and it is a statistical approach. When we take a closer look at a traditional NLP learning model, we will be able to see a set of distinct tasks taking place, such as preprocessing data by removing unwanted data, feature engineering to get good numerical representations of textual data, learning to use machine learning algorithms with the aid of training data, and predicting outputs for novel unfamiliar data. Of these, feature engineering was the most time-consuming and crucial step for obtaining good performance on a given NLP task.
Understanding the traditional approach
The traditional approach to solving NLP tasks involves a collection of distinct subtasks. First, the text corpora need to be preprocessed focusing on reducing the vocabulary and distractions. By distractions, I refer to the things that distract the algorithm (for example, punctuation marks and stop word removal) from capturing the vital linguistic information required for the task.
Next, comes several feature engineering steps. The main objective of feature engineering is to make the learning easier for the algorithms. Often the features are hand-engineered and biased toward the human understanding of a language. Feature engineering was of utter importance for classical NLP algorithms, and consequently, the best performing systems often had the best engineered features. For example, for a sentiment classification task, you can represent a sentence with a parse tree and assign positive, negative, or neutral labels to each node/subtree in the tree to classify that sentence as positive or negative. Additionally, the feature engineering phase can use external resources such as WordNet (a lexical database) to develop better features. We will soon look at a simple feature engineering technique known as bag-of-words.
Next, the learning algorithm learns to perform well at the given task using the obtained features and optionally, the external resources. For example, for a text summarization task, a thesaurus that contains synonyms of words can be a good external resource. Finally, prediction occurs. Prediction is straightforward, where you will feed a new input and obtain the predicted label by forwarding the input through the learning model. The entire process of the traditional approach is depicted in Figure 1.2:
Example – generating football game summaries
To gain an in-depth understanding of the traditional approach to NLP, let's consider a task of automatic text generation from the statistics of a game of football. We have several sets of game statistics (for example, score, penalties, and yellow cards) and the corresponding articles generated for that game by a journalist, as the training data. Let's also assume that for a given game, we have a mapping from each statistical parameter to the most relevant phrase of the summary for that parameter. Our task here is that, given a new game, we need to generate a natural looking summary about the game. Of course, this can be as simple as finding the best-matching statistics for the new game from the training data and retrieving the corresponding summary. However, there are more sophisticated and elegant ways of generating text.
If we were to incorporate machine learning to generate natural language, a sequence of operations such as preprocessing the text, tokenization, feature engineering, learning, and prediction are likely to be performed.
Preprocessing the text involves operations, such as stemming (for example, converting listened to listen) and removing punctuation (for example, ! and ;), in order to reduce the vocabulary (that is, features), thus reducing the memory requirement. It is important to understand that stemming is not a trivial operation. It might appear that stemming is a simple operation that relies on a simple set of rules such as removing ed from a verb (for example, the stemmed result of listened is listen); however, it requires more than a simple rule base to develop a good stemming algorithm, as stemming certain words can be tricky (for example, the stemmed result of argued is argue). In addition, the effort required for proper stemming can vary in complexity for other languages.
Tokenization is another preprocessing step that might need to be performed. Tokenization is the process of dividing a corpus into small entities (for example, words). This might appear trivial for a language such as English, as the words are isolated; however, this is not the case for certain languages such as Thai, Japanese, and Chinese, as these languages are not consistently delimited.
Feature engineering is used to transform raw text data into an appealing numerical format so that a model can be trained on that data, for example, converting text into a bag-of-words representation or using the n-gram representation which we will discuss later. However, remember that state-of-the-art classical models rely on much more sophisticated feature engineering techniques.
The following are some of the feature engineering techniques:
Bag-of-words: This is a feature engineering technique that creates feature representations based on the word occurrence frequency. For example, let's consider the following sentences:
Bob went to the market to buy some flowers
Bob bought the flowers to give to Mary
The vocabulary for these two sentences would be:
["Bob", "went", "to", "the", "market", "buy", "some", "flowers", "bought", "give", "Mary"]
Next, we will create a feature vector of size V (vocabulary size) for each sentence showing how many times each word in the vocabulary appears in the sentence. In this example, the feature vectors for the sentences would respectively be as follows:
[1, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0]
[1, 0, 2, 1, 0, 0, 0, 1, 1, 1, 1]
A crucial limitation of the bag-of-words method is that it loses contextual information as the order of words is no longer preserved.
n-gram: This is another feature engineering technique that breaks down text into smaller components consisting of n letters (or words). For example, 2-gram would break the text into two-letter (or two-word) entities. For example, consider this sentence:
Bob went to the market to buy some flowers
The letter level n-gram decomposition for this sentence is as follows:
["Bo", "ob", "b ", " w", "we", "en", ..., "me", "e "," f", "fl", "lo", "ow", "we", "er", "rs"]
The word-based n-gram decomposition is this:
["Bob went", "went to", "to the", "the market", ..., "to buy", "buy some", "some flowers"]
The advantage in this representation (letter, level) is that the vocabulary will be significantly smaller than if we were to use words as features for large corpora.
Next, we need to structure our data to be able to feed it into a learning model. For example, we will have data tuples of the form, (statistic, a phrase explaining the statistic) as follows:
Total goals = 4, "The game was tied with 2 goals for each team at the end of the first half"
Team 1 = Manchester United, "The game was between Manchester United and Barcelona"
Team 1 goals = 5, "Manchester United managed to get 5 goals"
The learning process may comprise three sub modules: a Hidden Markov Model (HMM), a sentence planner, and a discourse planner. In our example, a HMM might learn the morphological structure and grammatical properties of the language by analyzing the corpus of related phrases. More specifically, we will concatenate each phrase in our dataset to form a sequence, where the first element is the statistic followed by the phrase explaining it. Then, we will train a HMM by asking it to predict the next word, given the current sequence. Concretely, we will first input the statistic to the HMM and then get the prediction made by the HMM; then, we will concatenate the last prediction to the current sequence and ask the HMM to give another prediction, and so on. This will enable the HMM to output meaningful phrases, given statistics.
Next, we can have a sentence planner that corrects any linguistic mistakes (for example, morphological or grammar), which we might have in the phrases. For examples, a sentence planner outputs the phrase, I go house as I go home; it can use a database of rules, which contains the correct way of conveying meanings (for example, the need of a preposition between a verb and the word house).
Now we can generate a set of phrases for a given set of statistics using a HMM. Then, we need to aggregate these phrases in such a way that an essay made from the collection of phrases is human readable and flows correctly. For example, consider the three phrases, Player 10 of the Barcelona team scored a goal in the second half, Barcelona played against Manchester United, and Player 3 from Manchester United got a yellow card in the first half; having these sentences in this order does not make much sense. We like to have them in this order: Barcelona played against Manchester United, Player 3 from Manchester United got a yellow card in the first half, and Player 10 of the Barcelona team scored a goal in the second half. To do this, we use a discourse planner; discourse planners can order and structure a set of messages that need to be conveyed.
Now we can get a set of arbitrary test statistics and obtain an essay explaining the statistics by following the preceding workflow, which is depicted in Figure 1.3:
Here, it is important to note that this is a very high level explanation that only covers the main general-purpose components that are most likely to be included in the traditional way of NLP. The details can largely vary according to the particular application we are interested in solving. For example, additional application-specific crucial components might be needed for certain tasks (a rule base and an alignment model in machine translation). However, in this book, we do not stress about such details as the main objective here is to discuss more modern ways of natural language processing.
Drawbacks of the traditional approach
Let's list several key drawbacks of the traditional approach as this would lay a good foundation for discussing the motivation for deep learning:
The preprocessing steps used in traditional NLP forces a trade-off of potentially useful information embedded in the text (for example, punctuation and tense information) in order to make the learning feasible by reducing the vocabulary. Though preprocessing is still used in modern deep-learning-based solutions, it is not as crucial as for the traditional NLP workflow due to the large representational capacity of deep networks.
Feature engineering needs to be performed manually by hand. In order to design a reliable system, good features need to be devised. This process can be very tedious as different feature spaces need to be extensively explored. Additionally, in order to effectively explore robust features, domain expertise is required, which can be scarce for certain NLP tasks.
Various external resources are needed for it to perform well, and there are not many freely available ones. Such external resources often consist of manually created information stored in large databases. Creating one for a particular task can take several years, depending on the severity of the task (for example, a machine translation rule base).