Book Image

Deep Learning for Natural Language Processing

By : Karthiek Reddy Bokka, Shubhangi Hora, Tanuj Jain, Monicah Wambugu
Book Image

Deep Learning for Natural Language Processing

By: Karthiek Reddy Bokka, Shubhangi Hora, Tanuj Jain, Monicah Wambugu

Overview of this book

Applying deep learning approaches to various NLP tasks can take your computational algorithms to a completely new level in terms of speed and accuracy. Deep Learning for Natural Language Processing starts by highlighting the basic building blocks of the natural language processing domain. The book goes on to introduce the problems that you can solve using state-of-the-art neural network models. After this, delving into the various neural network architectures and their specific areas of application will help you to understand how to select the best model to suit your needs. As you advance through this deep learning book, you’ll study convolutional, recurrent, and recursive neural networks, in addition to covering long short-term memory networks (LSTM). Understanding these networks will help you to implement their models using Keras. In later chapters, you will be able to develop a trigger word detection application using NLP techniques such as attention model and beam search. By the end of this book, you will not only have sound knowledge of natural language processing, but also be able to select the best text preprocessing and neural network models to solve a number of NLP issues.
Table of Contents (11 chapters)

Word Embeddings

As mentioned in the earlier sections of this chapter, natural language processing prepares textual data for machine learning and deep learning models. The models perform most efficiently when provided with numerical data as input, and thus a key role of natural language processing is to transform preprocessed textual data into numerical data, which is a numerical representation of the textual data.

This is what word embeddings are: they are numerical representations in the form of real-value vectors for text. Words that have similar meanings map to similar vectors and thus have similar representations. This aids the machine in learning the meaning and context of different words. Since word embeddings are vectors mapping to individual words, word embeddings can only be generated once tokenization has been performed on the corpus.

Figure 1.17: Example for word embeddings
Figure 1.17: Example for word embeddings

Word embeddings encompass a variety of techniques used to create a learned numerical representation and are the most popular way to represent a document's vocabulary. The beneficial aspect of word embeddings is that they are able to capture contextual, semantic, and syntactic similarities, and the relations of a word with other words, to effectively train the machine to comprehend natural language. This is the main aim of word embeddings – to form clusters of similar vectors that correspond to words with similar meanings.

The reason for using word embeddings is to make machines understand synonyms the same way we do. Consider an example of online restaurant reviews – they consist of adjectives describing food, ambience, and the overall experience. They are either positive or negative, and comprehending which reviews fall into which of these two categories is important. The automatic categorization of these reviews can provide a restaurant with quick insights as to what areas they need to improve on, what people liked about their restaurant, and so on.

There exist a variety of adjectives that can be classified as positive, and the same goes with negative adjectives. Thus, not only does the machine need to be able to differentiate between negative and positive, it also needs to learn and understand that multiple words can relate to the same category because they ultimately mean the same thing. This is where word embeddings are helpful.

Consider the example of restaurant reviews received on a food service application. The following two sentences are from two separate restaurant reviews:

  • Sentence A – The food here was great.
  • Sentence B – The food here was good.

The machine needs to be able to comprehend that both these reviews are positive and mean a similar thing, despite the adjective in both sentences being different. This is done by creating word embeddings, because the two words 'good' and 'great' map to two separate but similar real-value vectors and, thus, can be clustered together.

The Generation of Word Embeddings

We've understood what word embeddings are and their importance; now we need to understand how they're generated. The process of transforming words into their real-value vectors is known as vectorization and is done by word embedding techniques. There are many word embedding techniques available, but in this chapter, we will be discussing the two main ones – Word2Vec and GloVe. Once word embeddings (vectors) have been created, they combine to form a vector space, which is an algebraic model consisting of vectors that follow the rules of vector addition and scalar multiplication. If you don't remember your linear algebra, this might be a good time to quickly review it.

Word2Vec

As mentioned earlier, Word2Vec is one of the word embedding techniques used to generate vectors from words – something you can probably understand from the name itself.

Word2Vec is a shallow neural network – it has only two layers – and thus does not qualify as a deep learning model. The input is a text corpus, which it uses to generate vectors as the output. These vectors are known as feature vectors for the words present in the input corpus. It transforms a corpus into numerical data that can be understood by a deep neural network.

The aim of Word2Vec is to understand the probability of two or more words occurring together and thus to group words with similar meanings together to form a cluster in a vector space. Like any other machine learning or deep learning model, Word2Vec becomes more and more efficient by learning from past data and past occurrences of words. Thus, if provided with enough data and context, it can accurately guess a word's meaning based on past occurrences and context, similar to how we understand language.

For example, we are able to create a connection between the words 'boy' and 'man', and 'girl' and 'woman,' once we have heard and read about them and understood what they mean. Likewise, Word2Vec can also form this connection and generate vectors for these words that lie close together in the same cluster so as to ensure that the machine is aware that these words mean similar things.

Once Word2Vec has been given a corpus, it produces a vocabulary wherein each word has a vector of its own attached to it, which is known as its neural word embedding, and simply put, this neural word embedding is a word written in numbers.

Functioning of Word2Vec

Word2Vec trains a word against words that neighbor the word in the input corpus, and there are two methods of doing so:

  • Continuous Bag of Words (CBOW):

    This method predicts the current word based on the context. Thus, it takes the word's surrounding words as input to produce the word as output, and it chooses this word based on the probability that this is indeed the word that is a part of the sentence.

    For example, if the algorithm is provided with the words "the food was" and needs to predict the adjective after it, it is most likely to output the word "good" rather than output the word "delightful," since there would be more instances where the word "good" was used, and thus it has learned that "good" has a higher probability than "delightful." CBOW it said to be faster than skip-gram and has a higher accuracy with more frequent words.

Fig 1.18: The CBOW algorithm
Fig 1.18: The CBOW algorithm
  • Skip-gram

    This method predicts the words surrounding a word by taking the word as input, understanding the meaning of the word, and assigning it to a context. For example, if the algorithm was given the word "delightful," it would have to understand its meaning and learn from past context to predict that the probability that the surrounding words are "the food was" is highest. Skip-gram is said to work best with a small corpus.

Fig 1.19: The skip-gram algorithm
Fig 1.19: The skip-gram algorithm

While both methods seem to be working in opposite manners, they are essentially predicting words based on the context of local (nearby) words; they are using a window of context to predict what word will come next. This window is a configurable parameter.

The decision of choosing which algorithm to use depends on the corpus at hand. CBOW works on the basis of probability and thus chooses the word that has the highest probability of occurring given a specific context. This means it will usually predict only common and frequent words since those have the highest probabilities, and rare and infrequent words will never be produced by CBOW. Skip-gram, on the other hand, predicts context, and thus when given a word, it will take it as a new observation rather than comparing it to an existing word with a similar meaning. Due to this, rare words will not be avoided or looked over. However, this also means that a lot of training data will be required for skip-gram to work efficiently. Thus, depending on the training data and corpus at hand, the decision to use either algorithm should be made.

Essentially, both algorithms, and thus the model as a whole, require an intense learning phase where they are trained over thousands and millions of words to better understand context and meaning. Based on this, they are able to assign vectors to words and thus aid the machine in learning and predicting natural language. To understand Word2Vec better, let's do an exercise using Gensim's Word2Vec model.

Gensim is an open source library for unsupervised topic modeling and natural language processing using statistical machine learning. Gensim's Word2Vec algorithm takes an input of sequences of sentences in the form of individual words (tokens).

Also, we can use the min_count parameter. It exists to ask you how many instances of a word should be there in a corpus for it to be important to you, and then takes that into consideration when generating word embeddings. In a real-life scenario, when dealing with millions of words, a word that occurs only once or twice may not be important at all and thus can be ignored. However, right now, we are training our model only on three sentences each with only 5-6 words in every sentence. Thus, min_count is set to 1 since a word is important to us even if it occurs only once.

Exercise 8: Generating Word Embeddings Using Word2Vec

In this exercise, we will be using Gensim's Word2Vec algorithm to generate word embeddings post tokenization.

Note

You will need to have gensim installed on your system for the following exercise. You can use the following command to install it, if it is not already installed:

pip install –-upgrade gensim

For further information, click on https://radimrehurek.com/gensim/models/word2vec.html.

The following steps will help you with the solution:

  1. Open a new Jupyter notebook.
  2. Import the Word2Vec model from gensim, and import word_tokenize from nltk, as shown:

    from gensim.models import Word2Vec as wtv

    from nltk import word_tokenize

  3. Store three strings with some common words into three separate variables, and then tokenize each sentence and store all the tokens in an array, as shown:

    s1 = "Ariana Grande is a singer"

    s2 = "She has been a singer for many years"

    s3 = "Ariana is a great singer"

    sentences = [word_tokenize(s1), word_tokenize(s2), word_tokenize(s3)]

    You can print the array of sentences to view the tokens.

  4. Train the model, as follows:

    model = wtv(sentences, min_count = 1)

    Word2Vec's default value for min_count is 5.

  5. Summarize the model, as demonstrated:

    print('this is the summary of the model: ')

    print(model)

    Your output will look something like this:

    Figure 1.20: Output for model summary
    Figure 1.20: Output for model summary

    Vocab = 12 signifies that there are 12 different words present in the sentences that were input to the model.

  6. Let's find out what words are present in the vocabulary by summarizing it, as shown:

    words = list(model.wv.vocab)

    print('this is the vocabulary for our corpus: ')

    print(words)

    Your output will look something like this:

Figure 1.21: Output for the vocabulary of the corpus
Figure 1.21: Output for the vocabulary of the corpus

Let's see what the vector (word embedding) for the word 'singer' is:

print("the vector for the word singer: ")

print(model['singer'])

Expected output:

Figure 1.22: Vector for the word ‘singer’
Figure 1.22: Vector for the word 'singer'

Our Word2Vec model has been trained on these three sentences, and thus its vocabulary only includes the words present in this sentence. If we were to find words that are similar to a particular input word from our Word2Vec model, we wouldn't get words that actually make sense since the vocabulary is so small. Consider the following examples:

#lookup top 6 similar words to great

w1 = ["great"]

model.wv.most_similar (positive=w1, topn=6)

The 'positive' refers to the depiction of only positive vector values in the output.

The top six similar words to 'great' would be:

Figure 1.23: Word vectors similar to the word ‘great’
Figure 1.23: Word vectors similar to the word 'great'

Similarly, for the word 'singer', it could be as follows:

#lookup top 6 similar words to singer

w1 = ["singer"]

model.wv.most_similar (positive=w1, topn=6)

Figure 1.24: Word vector similar to word ‘singer’
Figure 1.24: Word vector similar to word 'singer'

We know that these words are not actually similar in meaning to our input words at all, and that also shows up in the correlation value beside them. However, they show up because these are the only words that exist in our vocabulary.

Another important parameter of the Gensim Word2Vec model is the size parameter. Its default value is 100 and implies the size of the neural network layers that are being used to train the model. This corresponds to the amount of freedom the training algorithm has. A larger size requires more data but also leads to higher accuracy.

Note

For more information on Gensim's Word2Vec model, click on

https://rare-technologies.com/word2vec-tutorial/.

GloVe

GloVe, an abbreviation of "global vectors," is a word embedding technique that has been developed by Stanford. It is an unsupervised learning algorithm that builds on Word2Vec. While Word2Vec is quite successful in generating word embeddings, the issue with it is that is it has a small window through which it focuses on local words and local context to predict words. This means that it is unable to learn from the frequency of words present globally, that is, in the entire corpus. GloVe, as mentioned in its name, looks at all the words present in a corpus.

While Word2Vec is a predictive model as it learns vectors to improve its predictive abilities, GloVe is a count-based model. What this means is that GloVe learns its vectors by performing dimensionality reduction on a co-occurrence counts matrix. The connections that GloVe is able to make are along the lines of this:

king – man + woman = queen

This means it's able to understand that "king" and "queen" share a relationship that is similar to that between "man" and "woman".

These are complicated terms, so let's understand them one by one. All of these concepts come from statistics and linear algebra, so if you already know what's going on, you can skip to the activity!

When dealing with a corpus, there exist algorithms to construct matrices based on term frequencies. Basically, these matrices contain words that occur in a document as rows, and the columns are either paragraphs or separate documents. The elements of the matrices represent the frequency with which the words occur in the documents. Naturally, with a large corpus, this matrix will be huge. Processing such a large matrix will take a lot of time and memory, thus we perform dimensionality reduction. This is the process of reducing the size of the matrix so it is possible to perform further operations on it.

In the case of GloVe, the matrix is known as a co-occurrence counts matrix, which contains information on how many times a word has occurred in a particular context in a corpus. The rows are the words and the columns are the contexts. This matrix is then factorized in order to reduce the dimensions, and the new matrix has a vector representation for each word.

GloVe also has pretrained words with vectors attached to them that can be used if the semantics match the corpus and task at hand. The following activity guides you through the process of implementing GloVe in Python, except that the code isn't directly given to you, so you'll have to do some thinking and maybe some googling. Try it out!

Exercise 9: Generating Word Embeddings Using GloVe

In this exercise, we will be generating word embeddings using Glove-Python.

Note

To install Glove-Python on your platform, go to https://pypi.org/project/glove/#files.

Download the Text8Corpus from http://mattmahoney.net/dc/text8.zip.

Extract the file and store it with your Jupyter notebook.

  1. Import itertools:

    import itertools

  2. We need a corpus to generate word embeddings for, and the gensim.models.word2vec library, luckily, has one called Text8Corpus. Import this along with two modules from the Glove-Python library:

    from gensim.models.word2vec import Text8Corpus

    from glove import Corpus, Glove

  3. Convert the corpus into sentences in the form of a list using itertools:

    sentences = list(itertools.islice(Text8Corpus('text8'),None))

  4. Initiate the Corpus() model and fit it on to the sentences:

    corpus = Corpus()

    corpus.fit(sentences, window=10)

    The window parameter controls how many neighboring words are considered.

  5. Now that we have prepared our corpus, we need to train the embeddings. Initiate the Glove() model:

    glove = Glove(no_components=100, learning_rate=0.05)

  6. Generate a co-occurrence matrix based on the corpus and fit the glove model on to this matrix:

    glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)

    The model has been trained!

  7. Add the dictionary of the corpus:

    glove.add_dictionary(corpus.dictionary)

  8. Use the following command to see which words are similar to your choice of word based on the word embeddings generated:

    glove.most_similar('man')

    Expected output:

Figure 1.25: Output of word embeddings for ‘man’
Figure 1.25: Output of word embeddings for 'man'

You can try this out for several different words to see which words neighbor them and are the most similar to them:

glove.most_similar('queen', number = 10)

Expected output:

Figure 1.26: Output of word embeddings for ‘queen’
Figure 1.26: Output of word embeddings for 'queen'

Note

To learn more about GloVe, go to https://nlp.stanford.edu/projects/glove/.

Activity 1: Generating Word Embeddings from a Corpus Using Word2Vec.

You have been given the task of training a Word2Vec model on a particular corpus – the Text8Corpus, in this case – to determine which words are similar to each other. The following steps will help you with the solution.

Note

You can find the text corpus file at http://mattmahoney.net/dc/text8.zip.

  1. Upload the text corpus from the link given previously.
  2. Import word2vec from gensim models.
  3. Store the corpus in a variable.
  4. Fit the word2vec model on the corpus.
  5. Find the most similar word to 'man'.
  6. 'Father' is to 'girl', 'x' is to "boy." Find the top 3 words for x.

    Note

    The solution for the activity can be found on page 296.

    Expected Outputs:

Figure 1.27: Output for similar word embeddings
Figure 1.27: Output for similar word embeddings

Top three words for 'x' could be:

Figure 1.28: Output for top three words for ‘x’
Figure 1.28: Output for top three words for 'x'