Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Advanced Natural Language Processing with TensorFlow 2
  • Table Of Contents Toc
Advanced Natural Language Processing with TensorFlow 2

Advanced Natural Language Processing with TensorFlow 2

By : Ashish Bansal, Mullen
4.8 (35)
close
close
Advanced Natural Language Processing with TensorFlow 2

Advanced Natural Language Processing with TensorFlow 2

4.8 (35)
By: Ashish Bansal, Mullen

Overview of this book

Recently, there have been tremendous advances in NLP, and we are now moving from research labs into practical applications. This book comes with a perfect blend of both the theoretical and practical aspects of trending and complex NLP techniques. The book is focused on innovative applications in the field of NLP, language generation, and dialogue systems. It helps you apply the concepts of pre-processing text using techniques such as tokenization, parts of speech tagging, and lemmatization using popular libraries such as Stanford NLP and SpaCy. You will build Named Entity Recognition (NER) from scratch using Conditional Random Fields and Viterbi Decoding on top of RNNs. The book covers key emerging areas such as generating text for use in sentence completion and text summarization, bridging images and text by generating captions for images, and managing dialogue aspects of chatbots. You will learn how to apply transfer learning and fine-tuning using TensorFlow 2. Further, it covers practical techniques that can simplify the labelling of textual data. The book also has a working code that is adaptable to your use cases for each tech piece. By the end of the book, you will have an advanced knowledge of the tools, techniques and deep learning architecture used to solve complex NLP problems.
Table of Contents (13 chapters)
close
close
11
Other Books You May Enjoy
12
Index

Data collection and labeling

The first step of any Machine Learning (ML) project is to obtain a dataset. Fortunately, in the text domain, there is plenty of data to be found. A common approach is to use libraries such as scrapy or Beautiful Soup to scrape data from the web. However, data is usually unlabeled, and as such can't be used in supervised models directly. This data is quite useful though. Through the use of transfer learning, a language model can be trained using unsupervised or semi-supervised methods and can be further used with a small training dataset specific to the task at hand. We will cover transfer learning in more depth in Chapter 3, Named Entity Recognition (NER) with BiLSTMs, CRFs, and Viterbi Decoding, when we look at transfer learning using BERT embeddings.

In the labeling step, textual data sourced in the data collection step is labeled with the right classes. Let's take some examples. If the task is to build a spam classifier for emails, then the previous step would involve collecting lots of emails. This labeling step would be to attach a spam or not spam label to each email. Another example could be sentiment detection on tweets. The data collection step would involve gathering a number of tweets. This step would label each tweet with a label that acts as a ground truth. A more involved example would involve collecting news articles, where the labels would be summaries of the articles. Yet another example of such a case would be an email auto-reply functionality. Like the spam case, a number of emails with their replies would need to be collected. The labels in this case would be short pieces of text that would approximate replies. If you are working on a specific domain without much public data, you may have to do these steps yourself.

Given that text data is generally available (outside of specific domains like health), labeling is usually the biggest challenge. It can be quite time consuming or resource intensive to label data. There has been a lot of recent focus on using semi-supervised approaches to labeling data. We will cover some methods for labeling data at scale using semi-supervised methods and the snorkel library in Chapter 7, Multi-modal Networks and Image Captioning with ResNets and Transformer, when we look at weakly supervised learning for classification using Snorkel.

There is a number of commonly used datasets that are available on the web for use in training models. Using transfer learning, these generic datasets can be used to prime ML models and then you can use a small amount of domain-specific data to fine-tune the model. Using these publicly available datasets gives us a few advantages. First, all the data collection has been already performed. Second, labeling has already been done. Lastly, using such a dataset allows the comparison of results with the state of the art; most papers use specific datasets in their area of research and publish benchmarks. For example, the Stanford Question Answering Dataset (or SQuAD for short) is often used as a benchmark for question-answering models. It is a good source to train on as well.

Collecting labeled data

In this book, we will rely on publicly available datasets. The appropriate datasets will be called out in their respective chapters along with instructions on downloading them. To build a spam detection system on an email dataset, we will be using the SMS Spam Collection dataset made available by University of California, Irvine. This dataset can be downloaded using instructions available in the tip box below. Each SMS is tagged as "SPAM" or "HAM," with the latter indicating it is not a spam message.

University of California, Irvine, is a great source of machine learning datasets. You can see all the datasets they provide by visiting http://archive.ics.uci.edu/ml/datasets.php. Specifically for NLP, you can see some publicly available datasets on https://github.com/niderhoff/nlp-datasets.

Before we start working with the data, the development environment needs to be set up. Let's take a quick moment to set up the development environment.

Development environment setup

In this chapter, we will be using Google Colaboratory, or Colab for short, to write code. You can use your Google account, or register a new account. Google Colab is free to use, requires no configuration, and also provides access to GPUs. The user interface is very similar to a Jupyter notebook, so it should seem familiar. To get started, please navigate to colab.research.google.com using a supported web browser. A web page similar to the screenshot below should appear:

Figure 1.2: Google Colab website

The next step is to create a new notebook. There are a couple of options. The first option is to create a new notebook in Colab and type in the code as you go along in the chapter. The second option is to upload a notebook from the local drive into Colab. It is also possible to pull in notebooks from GitHub into Colab, the process for which is detailed on the Colab website. For the purposes of this chapter, a complete notebook named SMS_Spam_Detection.ipynb is available in the GitHub repository of the book in the chapter1-nlp-essentials folder. Please upload this notebook into Google Colab by clicking File | Upload Notebook. Specific sections of this notebook will be referred to at the appropriate points in the chapter in tip boxes. The instructions for creating the notebook from scratch are in the main description.

Click on the File menu option at the top left and click on New Notebook. A new notebook will open in a new browser tab. Click on the notebook name at the top left, just above the File menu option, and edit it to read SMS_Spam_Detection. Now the development environment is set up. It is time to begin loading in data.

First, let us edit the first line of the notebook and import TensorFlow 2. Enter the following code in the first cell and execute it:

%tensorflow_version 2.x
import tensorflow as tf
import os
import io
tf.__version__

The output of running this cell should look like this:

TensorFlow 2.x is selected.
'2.4.0'

This confirms that version 2.4.0 of the TensorFlow library was loaded. The highlighted line in the preceding code block is a magic command for Google Colab, instructing it to use TensorFlow version 2+. The next step is to download the data file and unzip to a location in the Colab notebook on the cloud.

The code for loading the data is in the Download Data section of the notebook. Also note that as of writing, the release version of TensorFlow was 2.4.

This can be done with the following code:

# Download the zip file
path_to_zip = tf.keras.utils.get_file("smsspamcollection.zip",
origin="https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip",
                  extract=True)
# Unzip the file into a folder
!unzip $path_to_zip -d data

The following output confirms that the data was downloaded and extracted:

Archive:  /root/.keras/datasets/smsspamcollection.zip
  inflating: data/SMSSpamCollection  
  inflating: data/readme

Reading the data file is trivial:

# Let's see if we read the data correctly
lines = io.open('data/SMSSpamCollection').read().strip().split('\n')
lines[0]

The last line of code shows a sample line of data:

'ham\tGo until jurong point, crazy.. Available only in bugis n great world'

This example is labeled as not spam. The next step is to split each line into two columns – one with the text of the message and the other as the label. While we are separating these labels, we will also convert the labels to numeric values. Since we are interested in predicting spam messages, we can assign a value of 1 to the spam messages. A value of 0 will be assigned to legitimate messages.

The code for this part is in the Pre-Process Data section of the notebook.

Please note that the following code is verbose for clarity:

spam_dataset = []
for line in lines:
  label, text = line.split('\t')
  if label.strip() == 'spam':
    spam_dataset.append((1, text.strip()))
  else:
    spam_dataset.append(((0, text.strip())))
print(spam_dataset[0]) 
(0, 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')

Now the dataset is ready for further processing in the pipeline. However, let's take a short detour to see how to configure GPU access in Google Colab.

Enabling GPUs on Google Colab

One of the advantages of using Google Colab is access to free GPUs for small tasks. GPUs make a big difference in the training time of NLP models, especially ones that use Recurrent Neural Networks (RNNs). The first step in enabling GPU access is to start a runtime, which can be done by executing a command in the notebook. Then, click on the Runtime menu option and select the Change Runtime option, as shown in the following screenshot:

A screenshot of a cell phone

Description automatically generated

Figure 1.3: Colab runtime settings menu option

Next, a dialog box will show up, as shown in the following screenshot. Expand the Hardware Accelerator option and select GPU:

A screenshot of a cell phone

Description automatically generated

Figure 1.4: Enabling GPUs on Colab

Now you should have access to a GPU in your Colab notebook! In NLP models, especially when using RNNs, GPUs can shave a lot of minutes or hours off the training time.

For now, let's turn our attention back to the data that has been loaded and is ready to be processed further for use in models.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Advanced Natural Language Processing with TensorFlow 2
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon