Mastering Natural Language Processing with Python

Mastering Natural Language Processing with Python

By : Deepti Chopra, Nisheeth Joshi, Iti Mathur

Buy this Book

Mastering Natural Language Processing with Python

By: Deepti Chopra, Nisheeth Joshi, Iti Mathur

Buy this Book

Overview of this book

Natural Language Processing is one of the fields of computational linguistics and artificial intelligence that is concerned with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. This book will give you expertise on how to employ various NLP tasks in Python, giving you an insight into the best practices when designing and building NLP-based applications using Python. It will help you become an expert in no time and assist you in creating your own NLP projects using NLTK. You will sequentially be guided through applying machine learning tools to develop various models. We’ll give you clarity on how to create training data and how to implement major NLP applications such as Named Entity Recognition, Question Answering System, Discourse Analysis, Transliteration, Word Sense disambiguation, Information Retrieval, Sentiment Analysis, Text Summarization, and Anaphora Resolution.

Mastering Natural Language Processing with Python

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Working with Strings

Tokenization

Normalization

Substituting and correcting tokens

Applying Zipf's law to text

Similarity measures

Summary

Statistical Language Modeling

Understanding word frequency

Applying smoothing on the MLE model

Develop a back-off mechanism for MLE

Applying interpolation on data to get mix and match

Evaluate a language model through perplexity

Applying metropolis hastings in modeling languages

Applying Gibbs sampling in language processing

Summary

Morphology – Getting Our Feet Wet

Introducing morphology

Understanding stemmer

Understanding lemmatization

Developing a stemmer for non-English language

Morphological analyzer

Morphological generator

Search engine

Summary

Parts-of-Speech Tagging – Identifying Words

Introducing parts-of-speech tagging

Creating POS-tagged corpora

Selecting a machine learning algorithm

Statistical modeling involving the n-gram approach

Developing a chunker using pos-tagged corpora

Summary

Parsing – Analyzing Training Data

Introducing parsing

Treebank construction

Extracting Context Free Grammar (CFG) rules from Treebank

Creating a probabilistic Context Free Grammar from CFG

CYK chart parsing algorithm

Earley chart parsing algorithm

Summary

Semantic Analysis – Meaning Matters

Introducing semantic analysis

Generation of the synset id from Wordnet

Disambiguating senses using Wordnet

Summary

Sentiment Analysis – I Am Happy

Introducing sentiment analysis

Summary

Information Retrieval – Accessing Information

Introducing information retrieval

Vector space scoring and query operator interaction

Developing an IR system using latent semantic indexing

Text summarization

Question-answering system

Summary

Discourse Analysis – Knowing Is Believing

Introducing discourse analysis

Summary

Evaluation of NLP Systems – Analyzing Performance

The need for evaluation of NLP systems

Evaluation of IR system

Metrics for error identification

Metrics based on lexical matching

Metrics based on syntactic matching

Metrics using shallow semantic matching

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Substituting and correcting tokens

In this section, we will discuss the replacement of tokens with other tokens. We will also about how we can correct the spelling of tokens by replacing incorrectly spelled tokens with correctly spelled tokens.

Replacing words using regular expressions

In order to remove errors or perform text normalization, word replacement is done. One way by which text replacement is done is by using regular expressions. Previously, we faced problems while performing tokenization for contractions. Using text replacement, we can replace contractions with their expanded versions. For example, doesn't can be replaced by does not.

We will begin by writing the following code, naming this program replacers.py, and saving it in the nltkdata folder:

import re
replacement_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'ain\'t', 'is not'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
(r'(\w+)\'d', '\g<1> would')
]
class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in
        patterns]
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            (s, count) = re.subn(pattern, repl, s)
        return s

Here, replacement patterns are defined in which the first term denotes the pattern to be matched and the second term is its corresponding replacement pattern. The RegexpReplacer class has been defined to perform the task of compiling pattern pairs and it provides a method called replace(), whose function is to perform the replacement of a pattern with another pattern.

Example of the replacement of a text with another text

Let's see an example of how we can substitute a text with another text:

>>> import nltk
>>> from replacers import RegexpReplacer
>>> replacer= RegexpReplacer()
>>> replacer.replace("Don't hesitate to ask questions")
'Do not hesitate to ask questions'
>>> replacer.replace("She must've gone to the market but she didn't go")
'She must have gone to the market but she did not go'

The function of RegexpReplacer.replace() is substituting every instance of a replacement pattern with its corresponding substitution pattern. Here, must've is replaced by must have and didn't is replaced by did not, since the replacement pattern in replacers.py has already been defined by tuple pairs, that is,(r'(\w+)\'ve', '\g<1> have') and (r'(\w+)n\'t', '\g<1> not').

We can not only perform the replacement of contractions; we can also substitute a token with any other token.

Performing substitution before tokenization

Tokens substitution can be performed prior to tokenization so as to avoid the problem that occurs during tokenization for contractions:

>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> from replacers import RegexpReplacer
>>> replacer=RegexpReplacer()
>>> word_tokenize("Don't hesitate to ask questions")
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']
>>> word_tokenize(replacer.replace("Don't hesitate to ask questions"))
['Do', 'not', 'hesitate', 'to', 'ask', 'questions']

Dealing with repeating characters

Sometimes, people write words involving repeating characters that cause grammatical errors. For instance consider a sentence, I like it lotttttt. Here, lotttttt refers to lot. So now, we'll eliminate these repeating characters using the backreference approach, in which a character refers to the previous characters in a group in a regular expression. This is also considered one of the normalization tasks.

Firstly, append the following code to the previously created replacers.py:

class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'
    def replace(self, word):
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

Example of deleting repeating characters

Let's see an example of how we can delete repeating characters from a token:

>>> import nltk
>>> from replacers import RepeatReplacer
>>> replacer=RepeatReplacer()
>>> replacer.replace('lotttt')
'lot'
>>> replacer.replace('ohhhhh')
'oh'
>>> replacer.replace('ooohhhhh')
'oh'

The RepeatReplacer class works by compiling regular expressions and replacement strings and is defined using backreference.Repeat_regexp, which is present in replacers.py. It matches the starting characters that can be zero or many (\w*), ending characters that can be zero or many (\w*), or a character (\w)that is followed by same character.

For example, lotttt is split into (lo)(t)t(tt). Here, one t is reduced and the string becomes lottt. The process of splitting continues, and finally, the resultant string obtained is lot.

The problem with RepeatReplacer is that it will convert happy to hapy, which is inappropriate. To avoid this problem, we can embed wordnet along with it.

In the replacers.py program created previously, add the following lines to include wordnet:

import re
from nltk.corpus import wordnet 
class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'
    def replace(self, word):
        if wordnet.synsets(word):
            return word
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

Now, let's take a look at how the previously mentioned problem can be overcome:

>>> import nltk
>>> from replacers import RepeatReplacer
>>> replacer=RepeatReplacer()
>>> replacer.replace('happy')
'happy'

Replacing a word with its synonym

Now we will see how we can substitute a given word by its synonym. To the already existing replacers.py, we can add a class called WordReplacer that provides mapping between a word and its synonym:

class WordReplacer(object):
    def __init__(self, word_map):
        self.word_map = word_map
    def replace(self, word):
        return self.word_map.get(word, word)

Example of substituting word a with its synonym

Let's have a look at an example of substituting a word with its synonym: 

>>> import nltk
>>> from replacers import WordReplacer
>>> replacer=WordReplacer({'congrats':'congratulations'})
>>> replacer.replace('congrats')
'congratulations'
>>> replacer.replace('maths')
'maths'

In this code, the replace() function looks for the corresponding synonym for a word in word_map. If the synonym is present for a given word, then the word will be replaced by its synonym. If the synonym for a given word is not present, then no replacement will be performed; the same word will be returned.

Mastering Natural Language Processing with Python

By : Deepti Chopra, Nisheeth Joshi, Iti Mathur

Mastering Natural Language Processing with Python

By: Deepti Chopra, Nisheeth Joshi, Iti Mathur

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Natural Language Processing with Python

Substituting and correcting tokens

Replacing words using regular expressions

Example of the replacement of a text with another text

Performing substitution before tokenization

Dealing with repeating characters

Example of deleting repeating characters

Replacing a word with its synonym

Example of substituting word a with its synonym