Natural Language Processing with TensorFlow

Natural Language Processing with TensorFlow

By : Motaz Saad, Thushan Ganegedara

Buy this Book

Natural Language Processing with TensorFlow

By: Motaz Saad, Thushan Ganegedara

Buy this Book

Overview of this book

Natural language processing (NLP) supplies the majority of data available to deep learning applications, while TensorFlow is the most important deep learning framework currently available. Natural Language Processing with TensorFlow brings TensorFlow and NLP together to give you invaluable tools to work with the immense volume of unstructured data in today’s data streams, and apply these tools to specific NLP tasks. Thushan Ganegedara starts by giving you a grounding in NLP and TensorFlow basics. You'll then learn how to use Word2vec, including advanced extensions, to create word embeddings that turn sequences of words into vectors accessible to deep learning algorithms. Chapters on classical deep learning algorithms, like convolutional neural networks (CNN) and recurrent neural networks (RNN), demonstrate important NLP tasks as sentence classification and language generation. You will learn how to apply high-performance RNN models, like long short-term memory (LSTM) cells, to NLP tasks. You will also explore neural machine translation and implement a neural machine translator. After reading this book, you will gain an understanding of NLP and you'll have the skills to apply TensorFlow in deep learning NLP applications, and how to perform specific NLP tasks.

Natural Language Processing with TensorFlow

Contributors

Preface

Free Chapter

Introduction to Natural Language Processing

What is Natural Language Processing?

Tasks of Natural Language Processing

The traditional approach to Natural Language Processing

The deep learning approach to Natural Language Processing

The roadmap – beyond this chapter

Introduction to the technical tools

Summary

Understanding TensorFlow

What is TensorFlow?

Inputs, variables, outputs, and operations

Reusing variables with scoping

Implementing our first neural network

Summary

Word2vec – Learning Word Embeddings

What is a word representation or meaning?

Classical approaches to learning word representation

Word2vec – a neural network-based approach to learning word representation

The skip-gram algorithm

The Continuous Bag-of-Words algorithm

Summary

Advanced Word2vec

The original skip-gram algorithm

Comparing skip-gram with CBOW

Extensions to the word embeddings algorithms

More recent algorithms extending skip-gram and CBOW

GloVe – Global Vectors representation

Document classification with Word2vec

Summary

Sentence Classification with Convolutional Neural Networks

Introducing Convolution Neural Networks

Understanding Convolution Neural Networks

Exercise – image classification on MNIST with CNN

Using CNNs for sentence classification

Summary

Recurrent Neural Networks

Understanding Recurrent Neural Networks

Backpropagation Through Time

Applications of RNNs

Generating text with RNNs

Evaluating text results output from the RNN

Perplexity – measuring the quality of the text result

Recurrent Neural Networks with Context Features – RNNs with longer memory

Summary

Long Short-Term Memory Networks

Understanding Long Short-Term Memory Networks

How LSTMs solve the vanishing gradient problem

Other variants of LSTMs

Summary

Applications of LSTM – Generating Text

Our data

Implementing an LSTM

Comparing LSTMs to LSTMs with peephole connections and GRUs

Improving LSTMs – beam search

Improving LSTMs – generating text with words instead of n-grams

Using the TensorFlow RNN API

Summary

Applications of LSTM – Image Caption Generation

Getting to know the data

The machine learning pipeline for image caption generation

Extracting image features with CNNs

Implementation – loading weights and inferencing with VGG-

Learning word embeddings

Preparing captions for feeding into LSTMs

Generating data for LSTMs

Defining the LSTM

Evaluating the results quantitatively

Captions generated for test images

Using TensorFlow RNN API with pretrained GloVe word vectors

Summary

Sequence-to-Sequence Learning – Neural Machine Translation

Machine translation

A brief historical tour of machine translation

Understanding Neural Machine Translation

Preparing data for the NMT system

Training the NMT

Inference with NMT

The BLEU score – evaluating the machine translation systems

Implementing an NMT from scratch – a German to English translator

Training an NMT jointly with word embeddings

Improving NMTs

Attention

Other applications of Seq2Seq models – chatbots

Summary

Current Trends and the Future of Natural Language Processing

Current trends in NLP

Penetration into other research fields

Towards Artificial General Intelligence

NLP for social media

New tasks emerging

Newer machine learning models

Summary

References

Mathematical Foundations and Advanced TensorFlow

Basic data structures

Special types of matrices

Tensor/matrix operations

Probability

Introduction to Keras

Introduction to the TensorFlow seq2seq library

Visualizing word embeddings with TensorBoard

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Visualizing word embeddings with TensorBoard

When we wanted to visualize word embedding in Chapter 3, Word2vec – Learning Word Embeddings, we manually implemented the visualization with the t-SNE algorithm. However, you also could use TensorBoard for visualizing word embeddings. TensorBoard is a visualization tool provided with TensorFlow. You can use TensorBoard to visualize the TensorFlow variables in your program. This allows you to see how various variables behave over time (for example, model loss/accuracy), so you can identify potential issues in your model.

TensorBoard enables you to visualize scalar values and vectors as histograms. Apart from this, TensorBoard also allows you to visualize word embeddings. Therefore, it takes all the required code implementation away from you, if you need to analyze what the embeddings look like. Next we will see how we can use TensorBoard to visualize word embeddings. The code for this exercise is provided in tensorboard_word_embeddings.ipynb in the appendix folder.

Starting TensorBoard

First, we will list the steps for starting TensorBoard. TensorBoard acts as a service and runs on a specific port (by default, on 6006). To start TensorBoard, you will need to follow the following steps:

Open up Command Prompt (Windows) or Terminal (Ubuntu/macOS).
Go into the project home directory.
If you are using python virtuanenv, activate the virtual environment where you have installed TensorFlow.
Make sure that you can see the TensorFlow library through Python. To do this, follow these steps:
1. Type in python3, you will get a >>> looking prompt
2. Try import tensorflow as tf
3. If you can run this successfully, you are fine
4. Exit the python prompt (that is, >>>) by typing exit()
Type in tensorboard --logdir=models:
- The --logdir option points to the directory where you will create data to visualize
- Optionally, you can use --port=<port_you_like> to change the port TensorBoard runs on

You should now get the following message:

TensorBoard 1.6.0 at <url>;:6006 (Press CTRL+C to quit)

Enter the <url>:6006 in to the web browser. You should be able to see an orange dashboard at this point. You won't have anything to display because we haven't generated data.

Saving word embeddings and visualizing via TensorBoard

First, we will download and load the 50-dimensional GloVe embeddings we used in Chapter 9, Applications of LSTM – Image Caption Generation. For that first download the GloVe embedding file (glove.6B.zip) from https://nlp.stanford.edu/projects/glove/ and place it in the appendix folder. We will load the first 50,000 word vectors in the file and later use these to initialize a TensorFlow variable. We will also record the word strings of each word, as we will later provide these as labels for each point to display on TensorBoard:

vocabulary_size = 50000
pret_embeddings = np.empty(shape=(vocabulary_size,50),dtype=np.float32)

words = [] 

word_idx = 0
with zipfile.ZipFile('glove.6B.zip') as glovezip:
    with glovezip.open('glove.6B.50d.txt') as glovefile:
        for li, line in enumerate(glovefile):
            if (li+1)%10000==0: print('.',end='')
            line_tokens = line.decode('utf-8').split(' ')
            word = line_tokens[0]
            
            vector = [float(v) for v in line_tokens[1:]]
            assert len(vector)==50
            words.append(word)
            pret_embeddings[word_idx,:] = np.array(vector)
            word_idx += 1
            if word_idx == vocabulary_size:
                break

Now, we will define TensorFlow-related variables and operations. Before this, we will create a directory called models, which will be used to store the variables:

log_dir = 'models'

if not os.path.exists(log_dir):
    os.mkdir(log_dir)

Then, we will define a variable that will be initialized with the word embeddings we copied from the text file earlier:

embeddings = tf.get_variable('embeddings',shape=[vocabulary_size, 50],
                             initializer=tf.constant_initializer(pret_embeddings))

We will next create a session and initialize the variable we defined earlier:

session = tf.InteractiveSession()
tf.global_variables_initializer().run()

Thereafter, we will create a tf.train.Saver object. The Saver object can be used to save TensorFlow variables to the memory, so that they can later be restored if needed. In the following code, we will save the embedding variable to the models directory under the name, model.ckpt:

saver = tf.train.Saver({'embeddings':embeddings})
saver.save(session, os.path.join(log_dir, "model.ckpt"), 0)

We also need to save a metadata file. A metadata file contains labels/images or other types of information associated with the word embeddings, so that when you hover over the embedding visualization the corresponding points will show the word/label they represent. The metadata file should be of the .tsv (tab separated values) format and should contain vocabulary_size + 1 rows in it, where the first row contains the headers for the information you are including. In the following code, we will save two pieces of information: word strings and a unique identifier (that is, row index) for each word:

with open(os.path.join(log_dir,'metadata.tsv'), 'w',encoding='utf-8') as csvfile:
    writer = csv.writer(csvfile, delimiter='\t',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(['Word','Word ID'])
    for wi,w in enumerate(words):
      writer.writerow([w,wi])

Then, we will need to tell TensorFlow where it can find the metadata for the embedding data we saved to the disk. For this, we need to create a ProjectorConfig object, which maintains various configuration details about the embedding we want to display. The details stored in the ProjectorConfig folder will be saved to a file called projector_config.pbtxt in the models directory:

config = projector.ProjectorConfig()

Here, we will populate the required fields of the ProjectorConfig object we created. First, we will tell it the name of the variable we're interested in visualizing. Next, we will tell it where it can find the metadata corresponding to that variable:

embedding_config = config.embeddings.add()
embedding_config.tensor_name = embeddings.name
embedding_config.metadata_path = 'metadata.tsv'

We will now use a summary writer to write this to the projector_config.pbtxt file. TensorBoard will read this file at startup:

summary_writer = tf.summary.FileWriter(log_dir)
projector.visualize_embeddings(summary_writer, config)

Now if you load TensorBoard, you should see something similar to Figure A.3:

Figure A.3: Tensorboard view of the embeddings

When you hover over the displayed point cloud, it will show the label of the word you're currently hovering over, as we provided this information in the metadata.tsv file. Furthermore, you have several options. The first option (shown with a dotted line and marked as 1) will allow you to select a subset of the full embedding space. You can draw a bounding box over the area of the embedding space you're interested in, and it will look as shown in Figure A.4. I have selected the embeddings at the bottom right corner:

Figure A.4: Selecting a subset of the embedding space

Another option you have is the ability to view words themselves, instead of dots. You can do this by selecting the second option in Figure A.3 (show inside a solid box and marked as 2). This would look as shown in Figure A.5. Additionally, you can pan/zoom/rotate the view to your liking. If you click on the help button (shown within a solid box and marked as 1 in Figure A.5), it will show you a guide for controlling the view:

Figure A.5: Embedding vectors displayed as words instead of dots

Finally, you can change the visualization algorithm from the panel on the left-hand side (shown with a dashed line and marked with 3 in Figure A.3).

Natural Language Processing with TensorFlow

By : Motaz Saad, Thushan Ganegedara

Natural Language Processing with TensorFlow

By: Motaz Saad, Thushan Ganegedara

Overview of this book

Related Content you might be interested in

Current Title:

Natural Language Processing with TensorFlow

Deep Learning Essentials

Hands-On Natural Language Processing with PyTorch 1.x

Recurrent Neural Networks with Python Quick Start Guide

Visualizing word embeddings with TensorBoard

Starting TensorBoard

Saving word embeddings and visualizing via TensorBoard