Book Image

Mastering Natural Language Processing with Python

By : Deepti Chopra, Nisheeth Joshi, Iti Mathur
Book Image

Mastering Natural Language Processing with Python

By: Deepti Chopra, Nisheeth Joshi, Iti Mathur

Overview of this book

<p>Natural Language Processing is one of the fields of computational linguistics and artificial intelligence that is concerned with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning.</p> <p>This book will give you expertise on how to employ various NLP tasks in Python, giving you an insight into the best practices when designing and building NLP-based applications using Python. It will help you become an expert in no time and assist you in creating your own NLP projects using NLTK.</p> <p>You will sequentially be guided through applying machine learning tools to develop various models. We’ll give you clarity on how to create training data and how to implement major NLP applications such as Named Entity Recognition, Question Answering System, Discourse Analysis, Transliteration, Word Sense disambiguation, Information Retrieval, Sentiment Analysis, Text Summarization, and Anaphora Resolution.</p>
Table of Contents (17 chapters)
Mastering Natural Language Processing with Python
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Preface
Index

Substituting and correcting tokens


In this section, we will discuss the replacement of tokens with other tokens. We will also about how we can correct the spelling of tokens by replacing incorrectly spelled tokens with correctly spelled tokens.

Replacing words using regular expressions

In order to remove errors or perform text normalization, word replacement is done. One way by which text replacement is done is by using regular expressions. Previously, we faced problems while performing tokenization for contractions. Using text replacement, we can replace contractions with their expanded versions. For example, doesn't can be replaced by does not.

We will begin by writing the following code, naming this program replacers.py, and saving it in the nltkdata folder:

import re
replacement_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'ain\'t', 'is not'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
(r'(\w+)\'d', '\g<1> would')
]
class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in
        patterns]
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            (s, count) = re.subn(pattern, repl, s)
        return s

Here, replacement patterns are defined in which the first term denotes the pattern to be matched and the second term is its corresponding replacement pattern. The RegexpReplacer class has been defined to perform the task of compiling pattern pairs and it provides a method called replace(), whose function is to perform the replacement of a pattern with another pattern.

Example of the replacement of a text with another text

Let's see an example of how we can substitute a text with another text:

>>> import nltk
>>> from replacers import RegexpReplacer
>>> replacer= RegexpReplacer()
>>> replacer.replace("Don't hesitate to ask questions")
'Do not hesitate to ask questions'
>>> replacer.replace("She must've gone to the market but she didn't go")
'She must have gone to the market but she did not go'

The function of RegexpReplacer.replace() is substituting every instance of a replacement pattern with its corresponding substitution pattern. Here, must've is replaced by must have and didn't is replaced by did not, since the replacement pattern in replacers.py has already been defined by tuple pairs, that is,(r'(\w+)\'ve', '\g<1> have') and (r'(\w+)n\'t', '\g<1> not').

We can not only perform the replacement of contractions; we can also substitute a token with any other token.

Performing substitution before tokenization

Tokens substitution can be performed prior to tokenization so as to avoid the problem that occurs during tokenization for contractions:

>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> from replacers import RegexpReplacer
>>> replacer=RegexpReplacer()
>>> word_tokenize("Don't hesitate to ask questions")
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']
>>> word_tokenize(replacer.replace("Don't hesitate to ask questions"))
['Do', 'not', 'hesitate', 'to', 'ask', 'questions']

Dealing with repeating characters

Sometimes, people write words involving repeating characters that cause grammatical errors. For instance consider a sentence, I like it lotttttt. Here, lotttttt refers to lot. So now, we'll eliminate these repeating characters using the backreference approach, in which a character refers to the previous characters in a group in a regular expression. This is also considered one of the normalization tasks.

Firstly, append the following code to the previously created replacers.py:

class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'
    def replace(self, word):
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

Example of deleting repeating characters

Let's see an example of how we can delete repeating characters from a token:

>>> import nltk
>>> from replacers import RepeatReplacer
>>> replacer=RepeatReplacer()
>>> replacer.replace('lotttt')
'lot'
>>> replacer.replace('ohhhhh')
'oh'
>>> replacer.replace('ooohhhhh')
'oh'

The RepeatReplacer class works by compiling regular expressions and replacement strings and is defined using backreference.Repeat_regexp, which is present in replacers.py. It matches the starting characters that can be zero or many (\w*), ending characters that can be zero or many (\w*), or a character (\w)that is followed by same character.

For example, lotttt is split into (lo)(t)t(tt). Here, one t is reduced and the string becomes lottt. The process of splitting continues, and finally, the resultant string obtained is lot.

The problem with RepeatReplacer is that it will convert happy to hapy, which is inappropriate. To avoid this problem, we can embed wordnet along with it.

In the replacers.py program created previously, add the following lines to include wordnet:

import re
from nltk.corpus import wordnet 
class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'
    def replace(self, word):
        if wordnet.synsets(word):
            return word
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

Now, let's take a look at how the previously mentioned problem can be overcome:

>>> import nltk
>>> from replacers import RepeatReplacer
>>> replacer=RepeatReplacer()
>>> replacer.replace('happy')
'happy'

Replacing a word with its synonym

Now we will see how we can substitute a given word by its synonym. To the already existing replacers.py, we can add a class called WordReplacer that provides mapping between a word and its synonym:

class WordReplacer(object):
    def __init__(self, word_map):
        self.word_map = word_map
    def replace(self, word):
        return self.word_map.get(word, word)

Example of substituting word a with its synonym

Let's have a look at an example of substituting a word with its synonym:


>>> import nltk
>>> from replacers import WordReplacer
>>> replacer=WordReplacer({'congrats':'congratulations'})
>>> replacer.replace('congrats')
'congratulations'
>>> replacer.replace('maths')
'maths'

In this code, the replace() function looks for the corresponding synonym for a word in word_map. If the synonym is present for a given word, then the word will be replaced by its synonym. If the synonym for a given word is not present, then no replacement will be performed; the same word will be returned.