Book Image

Deep Learning for Natural Language Processing

By : Karthiek Reddy Bokka, Shubhangi Hora, Tanuj Jain, Monicah Wambugu
Book Image

Deep Learning for Natural Language Processing

By: Karthiek Reddy Bokka, Shubhangi Hora, Tanuj Jain, Monicah Wambugu

Overview of this book

Applying deep learning approaches to various NLP tasks can take your computational algorithms to a completely new level in terms of speed and accuracy. Deep Learning for Natural Language Processing starts by highlighting the basic building blocks of the natural language processing domain. The book goes on to introduce the problems that you can solve using state-of-the-art neural network models. After this, delving into the various neural network architectures and their specific areas of application will help you to understand how to select the best model to suit your needs. As you advance through this deep learning book, you’ll study convolutional, recurrent, and recursive neural networks, in addition to covering long short-term memory networks (LSTM). Understanding these networks will help you to implement their models using Keras. In later chapters, you will be able to develop a trigger word detection application using NLP techniques such as attention model and beam search. By the end of this book, you will not only have sound knowledge of natural language processing, but also be able to select the best text preprocessing and neural network models to solve a number of NLP issues.
Table of Contents (11 chapters)

Applications of Natural Language Processing

The following figure depicts the general application areas of natural language processing:

Figure 1.4: Application areas of natural language processing
Figure 1.4: Application areas of natural language processing
  • Automatic text summarization

    This involves processing corpora to provide a summary.

  • Translation

    This entails translation tools that translate text to and from different languages, for example, Google Translate.

  • Sentiment analysis

    This is also known as emotional artificial intelligence or opinion mining, and it is the process of identifying, extracting, and quantifying emotions and affective states from corpora, both written and spoken. Sentiment analysis tools are used to process things such as customer reviews and social media posts to understand emotional responses to and opinions regarding particular things, such as the quality of food at a new restaurant.

  • Information extraction

    This is the process of identifying and extracting important terms from corpora, known as entities. Named entity recognition falls under this category and is a process that will be explained in the next chapter.

  • Relationship extraction

    Relationship extraction involves extracting semantic relationships from corpora. Semantic relationships occur between two or more entities (such as people, organizations, and things) and fall into one of the many semantic categories. For example, if a relationship extraction tool was given a paragraph about Sundar Pichai and how he is the CEO of Google, the tool would be able to produce "Sundar Pichai works for Google" as output, with Sundar Pichai and Google being the two entities, and 'works for' being the semantic category that defines their relationship.

  • Chatbot

    Chatbots are forms of artificial intelligence that are designed to converse with humans via speech and text. The majority of them mimic humans and make it feel as though you are speaking to another human being. Chatbots are being used in the health industry to help people who suffer from depression and anxiety.

  • Social media analysis

    Social media applications such as Twitter and Facebook have hashtags and trends that are tracked and monitored using natural language processing to understand what is being talked about around the world. Additionally, natural language processing aids the process of moderation by filtering out negative, offensive, and inappropriate comments and posts.

  • Personal voice assistants

    Siri, Alexa, Google Assistant, and Cortana are all personal voice assistants that leverage natural language processing techniques to understand and respond to what we say.

  • Grammar checking

    Grammar-checking software automatically checks and corrects your grammar, punctuation, and typing errors.

Text Preprocessing

When answering questions on a comprehension passage, the questions are specific to different parts of the passage, and so while some words and sentences are important to you, others are irrelevant. The trick is to identify key words from the questions and match them to the passage to find the correct answer.

Text preprocessing works in a similar fashion – the machine doesn't need the irrelevant parts of the corpora; it just needs the important words and phrases required to execute the task at hand. Thus, text preprocessing techniques involve prepping the corpora for proper analysis and for the machine learning and deep learning models. Text preprocessing is basically telling the machine what it needs to take into consideration and what it can disregard.

Each corpus requires different text preprocessing techniques depending on the task that needs to be executed, and once you've learned the different preprocessing techniques, you'll understand where to use what and why. The order in which the techniques have been explained is usually the order in which they are performed.

We will be using the NLTK Python library in the following exercises, but feel free to use different libraries while doing the activities. NLTK stands for Natural Language Toolkit and is the simplest and one of the most popular Python libraries for natural language processing, which is why we will be using it to understand the basic concepts of natural language processing.

Note

For further information on NLTK, go to https://www.nltk.org/.

Text Preprocessing Techniques

The following are the most popular text preprocessing techniques in natural language processing:

  • Lowercasing/uppercasing
  • Noise removal
  • Text normalization
  • Stemming
  • Lemmatization
  • Tokenization
  • Removing stop words

Let's look at each technique one by one.

Lowercasing/Uppercasing

This is one of the most simple and effective preprocessing techniques that people often forget to use. It either converts all the existing uppercase characters into lowercase ones so that the entire corpus is in lowercase, or it converts all the lowercase characters present in the corpus into uppercase ones so that the entire corpus is in uppercase.

This method is especially useful when the size of the corpus isn't too large and the task involves identifying terms or outputs that could be recognized differently due to the case of the characters, since a machine inherently processes uppercase and lowercase letters as separate entities – 'A' is different from 'a.' This kind of variation in the input capitalization could result in incorrect output or no output at all.

An example of this would be a corpus that contains both 'India' and 'india.' Without applying lowercasing, the machine would recognize these as two separate terms, when in reality they're both different forms of the same word and correspond to the same country. After lowercasing, there would exist only one instance of the term "India," which would be 'india,' simplifying the task of finding all the places where India has been mentioned in the corpus.

Note

All exercises and activities will be primarily developed on Jupyter Notebook. You will need to have Python 3.6 and NLTK installed on your system.

Exercises 1 – 6 can be done within the same Jupyter notebook.

Exercise 1: Performing Lowercasing on a Sentence

In this exercise, we will take an input sentence with both uppercase and lowercase characters and convert them all into lowercase characters. The following steps will help you with the solution:

  1. Open cmd or another terminal depending on your operating system.
  2. Navigate to the desired path and use the following command to initiate a Jupyter notebook:

    jupyter notebook

  3. Store an input sentence in an 's' variable, as shown:

    s = "The cities I like most in India are Mumbai, Bangalore, Dharamsala and Allahabad."

  4. Apply the lower() function to convert the capital letters into lowercase characters and then print the new string, as shown:

    s = s.lower()

    print(s)

    Expected output:

    Figure 1.5: Output for lowercasing with mixed casing in a sentence
  5. Create an array of words with capitalized characters, as shown:

    words = ['indiA', 'India', 'india', 'iNDia']

  6. Using list comprehension, apply the lower() function on each element of the words array and then print the new array, as follows:

    words = [word.lower() for word in words]

    print(words)

    Expected output:

Figure 1.6: Output for lowercasing with mixed casing of words
Figure 1.6: Output for lowercasing with mixed casing of words

Noise Removal

Noise is a very general term and can mean different things with respect to different corpora and different tasks. What is considered noise for one task may be what is considered important for another, and thus this is a very domain-specific preprocessing technique. For example, when analyzing tweets, hashtags might be important to recognize trends and understand what's being spoken about around the globe, but hashtags may not be important when analyzing a news article, and so hashtags would be considered noise in the latter's case.

Noise doesn't include only words, but can also include symbols, punctuation marks, HTML markup (<,>, *, ?,.), numbers, whitespaces, stop words, particular terms, particular regular expressions, non-ASCII characters (\W|\d+), and parse terms.

Removing noise is crucial so that only the important parts of the corpora are fed into the models, ensuring accurate results. It also helps by bringing words into their root or standard form. Consider the following example:

Figure 1.7: Output for noise removal
Figure 1.7: Output for noise removal

After removing all the symbols and punctuation marks, all the instances of sleepy correspond to the one form of the word, enabling more efficient prediction and analysis of the corpus.

Exercise 2: Removing Noise from Words

In this exercise, we will take an input array containing words with noise attached (such as punctuation marks and HTML markup) and convert these words into their clean, noise-free forms. To do this, we will need to make use of Python's regular expression library. This library has several functions that allow us to filter through input data and remove the unnecessary parts, which is exactly what the process of noise removal aims to do.

Note

To learn more about 're,' click on https://docs.python.org/3/library/re.html.

  1. In the same Jupyter notebook, import the regular expression library, as shown:

    import re

  2. Create a function called 'clean_words', which will contain methods to remove different types of noise from the words, as follows:

    def clean_words(text):

    #remove html markup

    text = re.sub("(<.*?>)","",text)

    #remove non-ascii and digits

    text=re.sub("(\W|\d+)"," ",text)

    #remove whitespace

    text=text.strip()

    return text

  3. Create an array of raw words with noise, as demonstrated:

    raw = ['..sleepy', 'sleepy!!', '#sleepy', '>>>>>sleepy>>>>', '<a>sleepy</a>']

  4. Apply the clean_words() function on the words in the raw array and then print the array of clean words, as shown:

    clean = [clean_words(r) for r in raw]

    print(clean)

    Expected output:

Figure 1.8: Output for noise removal

Text Normalization

This is the process of converting a raw corpus into a canonical and standard form, which is basically to ensure that the textual input is guaranteed to be consistent before it is analyzed, processed, and operated upon.

Examples of text normalization would be mapping an abbreviation to its full form, converting several spellings of the same word to one spelling of the word, and so on.

The following are examples for canonical forms of incorrect spellings and abbreviations:

Figure 1.9: Canonical form for incorrect spellings
Figure 1.9: Canonical form for incorrect spellings
Figure 1.10: Canonical form for abbreviations
Figure 1.10: Canonical form for abbreviations

There is no standard way to go about normalization since it is very dependent on the corpus and the task at hand. The most common way to go about it is with dictionary mapping, which involves manually creating a dictionary that maps all the various forms of one word to that one word, and then replaces each of those words with one standard form of the word.

Stemming

Stemming is performed on a corpus to reduce words to their stem or root form. The reason for saying "stem or root form" is that the process of stemming doesn't always reduce the word to its root but sometimes just to its canonical form.

The words that undergo stemming are known as inflected words. These words are in a form that is different from the root form of the word, to imply an attribute such as the number or gender. For example, "journalists" is the plural form of "journalist." Thus, stemming would cut off the 's', bringing "journalists" to its root form:

Figure 1.11: Output for stemming
Figure 1.11: Output for stemming

Stemming is beneficial when building search applications due to the fact that when searching for something in particular, you might also want to find instances of that thing even if they're spelled differently. For example, if you're searching for exercises in this book, you might also want 'Exercise' to show up in your search.

However, stemming doesn't always provide the desired stem, since it works by chopping off the ends of the words. It's possible for the stemmer to reduce 'troubling' to 'troubl' instead of 'trouble' and this won't really help in problem solving, and so stemming isn't a method that's used too often. When it is used, Porter's stemming algorithm is the most common algorithm for stemming.

Exercise 3: Performing Stemming on Words

In this exercise, we will take an input array containing various forms of one word and convert these words into their stem forms.

  1. In the same Jupyter notebook, import the nltk and pandas libraries as well as Porter Stemmer, as shown:

    import nltk

    import pandas as pd

    from nltk.stem import PorterStemmer as ps

  2. Create an instance of stemmer, as follows:

    stemmer = ps()

  3. Create an array of different forms of the same word, as shown:

    words=['annoying', 'annoys', 'annoyed', 'annoy']

  4. Apply the stemmer to each of the words in the words array and store them in a new array, as given:

    stems =[stemmer.stem(word = word) for word in words]

  5. Print the raw words and their stems in the form of a DataFrame, as shown:

    sdf = pd.DataFrame({'raw word': words,'stem': stems})

    sdf

    Expected output:

Figure 1.12: Output of stemming
Figure 1.12: Output of stemming

Lemmatization

Lemmatization is a process that is like stemming – its purpose is to reduce a word to its root form. What makes it different is that it doesn't just chop the ends of words off to obtain this root form, but instead follows a process, abides by rules, and often uses WordNet for mappings to return words to their root forms. (WordNet is an English language database that consists of words and their definitions along with synonyms and antonyms. It is considered to be an amalgamation of a dictionary and a thesaurus.) For example, lemmatization is capable of transforming the word 'better' into its root form 'good', since 'better' is just the comparative form of 'good."

While this quality of lemmatization makes it highly appealing and more efficient when compared with stemming, the drawback is that since lemmatization follows such an organized procedure, it takes a lot more time than stemming does. Hence, lemmatization is not recommended when you're working with a large corpus.

Exercise 4: Performing Lemmatization on Words

In this exercise, we will take an input array containing various forms of one word and convert these words into their root form.

  1. In the same Jupyter notebook as the previous exercise, import WordNetLemmatizer and download WordNet, as shown:

    from nltk.stem import WordNetLemmatizer as wnl

    nltk.download('wordnet')

  2. Create an instance of lemmatizer, as follows:

    lemmatizer = wnl()

  3. Create an array of different forms of the same word, as demonstrated:

    words = ['troubling', 'troubled', 'troubles', 'trouble']

  4. Apply lemmatizer to each of the words in the words array and store them in a new array, as follows. The word parameter provides the lemmatize function with the word it is supposed to lemmatize. The pos parameter is the part of speech you want the lemma to be. 'v' stands for verb and thus the lemmatizer will reduce the word to its closest verb form:

    # v denotes verb in "pos"

    lemmatized = [lemmatizer.lemmatize(word = word, pos = 'v') for word in words]

  5. Print the raw words and their root forms in the form of a DataFrame, as shown:

    ldf = pd.DataFrame({'raw word': words,'lemmatized': lemmatized})

    ldf = ldf[['raw word','lemmatized']]

    ldf

    Expected output:

Figure 1.13: Output of lemmatization
Figure 1.13: Output of lemmatization

Tokenization

Tokenization is the process of breaking down a corpus into individual tokens. Tokens are the most commonly used words – thus, this process breaks down a corpus into individual words – but can also include punctuation marks and spaces, among other things.

This technique is one of the most important ones since it is a prerequisite for a lot of applications of natural language processing that we will be learning about in the next chapter, such as Parts-of-Speech (PoS) tagging. These algorithms take tokens as input and can't function with strings or paragraphs of text as input.

Tokenization can be performed to obtain individual words as well as individual sentences as tokens. Let's try both of these out in the following exercises.

Exercise 5: Tokenizing Words

In this exercise, we will take an input sentence and produce individual words as tokens from it.

  1. In the same Jupyter notebook, import nltk:

    import nltk

  2. From nltk, import word_tokenize and punkt, as shown:

    nltk.download('punkt')

    from nltk import word_tokenize

  3. Store words in a variable and apply word_tokenize() on it, then print the results, as follows:

    s = "hi! my name is john."

    tokens = word_tokenize(s)

    tokens

    Expected output:

Figure 1.14: Output for the tokenization of words
Figure 1.14: Output for the tokenization of words

As you can see, even the punctuation marks are tokenized and considered as individual tokens.

Now let's see how we can tokenize sentences.

Exercise 6: Tokenizing Sentences

In this exercise, we will take an input sentence and produce individual words as tokens from it.

  1. In the same Jupyter notebook, import sent_tokenize, as shown:

    from nltk import sent_tokenize

  2. Store two sentences in a variable (our sentence from the previous exercise was actually two sentences, so we can use the same one to see the difference between word and sentence tokenization) and apply sent_tokenize() on it, then print the results, as follows:

    s = "hi! my name is shubhangi."

    tokens = sent_tokenize(s)

    tokens

    Expected output:

Figure 1.15: Output for tokenizing sentences
Figure 1.15: Output for tokenizing sentences

As you can see, the two sentences have formed two individual tokens.

Additional Techniques

There are several ways to perform text preprocessing, including the usage of a variety of Python libraries such as BeautifulSoup to strip away HTML markup. The previous exercises serve the purpose of introducing some techniques to you. Depending on the task at hand, you may need to use just one or two or all of them, including the modifications made to them. For example, at the noise removal stage, you may find it necessary to remove words such as 'the,' 'and,' 'this,' and 'it.' So, you will need to create an array containing these words and pass the corpus through a for loop to store only the words that are not a part of that array, removing the noisy words from the corpus. Another way of doing this is given later in this chapter and is done after tokenization has been performed.

Exercise 7: Removing Stop Words

In this exercise, we will take an input sentence and remove the stop words from it.

  1. Open a Jupyter notebook and download 'stopwords' using the following line of code:

    nltk.download('stopwords')

  2. Store a sentence in a variable, as shown:

    s = "the weather is really hot and i want to go for a swim"

  3. Import stopwords and create a set of the English stop words, as follows:

    from nltk.corpus import stopwords

    stop_words = set(stopwords.words('english'))

  4. Tokenize the sentence using word_tokenize, and then store those tokens that do not occur in stop_words in an array. Then, print that array:

    tokens = word_tokenize(s)

    tokens = [word for word in tokens if not word in stop_words]

    print(tokens)

    Expected output:

Figure 1.16: Output after removing stopwords
Figure 1.16: Output after removing stopwords

Additionally, you may need to convert numbers into their word forms. This is also a method you can add to the noise removal function. Furthermore, you might need to make use of the contractions library, which serves the purpose of expanding the existing contractions in the text. For example, the contractions library will convert 'you're' into 'you are,' and if this is necessary for your task, then it is recommended to install this library and use it.

Text preprocessing techniques go beyond the ones that have been discussed in this chapter and can include anything and everything that is required for a task or a corpus. In some instances, some words may be important, while in others they won't be.