Mastering Natural Language Processing with Python

Mastering Natural Language Processing with Python

By : Deepti Chopra, Nisheeth Joshi, Iti Mathur

Buy this Book

Mastering Natural Language Processing with Python

By: Deepti Chopra, Nisheeth Joshi, Iti Mathur

Buy this Book

Overview of this book

Natural Language Processing is one of the fields of computational linguistics and artificial intelligence that is concerned with human-computer interaction. It provides a seamless interaction between computers and human beings and gives computers the ability to understand human speech with the help of machine learning. This book will give you expertise on how to employ various NLP tasks in Python, giving you an insight into the best practices when designing and building NLP-based applications using Python. It will help you become an expert in no time and assist you in creating your own NLP projects using NLTK. You will sequentially be guided through applying machine learning tools to develop various models. We’ll give you clarity on how to create training data and how to implement major NLP applications such as Named Entity Recognition, Question Answering System, Discourse Analysis, Transliteration, Word Sense disambiguation, Information Retrieval, Sentiment Analysis, Text Summarization, and Anaphora Resolution.

Mastering Natural Language Processing with Python

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Working with Strings

Tokenization

Normalization

Substituting and correcting tokens

Applying Zipf's law to text

Similarity measures

Summary

Statistical Language Modeling

Understanding word frequency

Applying smoothing on the MLE model

Develop a back-off mechanism for MLE

Applying interpolation on data to get mix and match

Evaluate a language model through perplexity

Applying metropolis hastings in modeling languages

Applying Gibbs sampling in language processing

Summary

Morphology – Getting Our Feet Wet

Introducing morphology

Understanding stemmer

Understanding lemmatization

Developing a stemmer for non-English language

Morphological analyzer

Morphological generator

Search engine

Summary

Parts-of-Speech Tagging – Identifying Words

Introducing parts-of-speech tagging

Creating POS-tagged corpora

Selecting a machine learning algorithm

Statistical modeling involving the n-gram approach

Developing a chunker using pos-tagged corpora

Summary

Parsing – Analyzing Training Data

Introducing parsing

Treebank construction

Extracting Context Free Grammar (CFG) rules from Treebank

Creating a probabilistic Context Free Grammar from CFG

CYK chart parsing algorithm

Earley chart parsing algorithm

Summary

Semantic Analysis – Meaning Matters

Introducing semantic analysis

Generation of the synset id from Wordnet

Disambiguating senses using Wordnet

Summary

Sentiment Analysis – I Am Happy

Introducing sentiment analysis

Summary

Information Retrieval – Accessing Information

Introducing information retrieval

Vector space scoring and query operator interaction

Developing an IR system using latent semantic indexing

Text summarization

Question-answering system

Summary

Discourse Analysis – Knowing Is Believing

Introducing discourse analysis

Summary

Evaluation of NLP Systems – Analyzing Performance

The need for evaluation of NLP systems

Evaluation of IR system

Metrics for error identification

Metrics based on lexical matching

Metrics based on syntactic matching

Metrics using shallow semantic matching

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Normalization

In order to carry out processing on natural language text, we need to perform normalization that mainly involves eliminating punctuation, converting the entire text into lowercase or uppercase, converting numbers into words, expanding abbreviations, canonicalization of text, and so on.

Eliminating punctuation

Sometimes, while tokenizing, it is desirable to remove punctuation. Removal of punctuation is considered one of the primary tasks while doing normalization in NLTK.

Consider the following example:

>>> text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."]
>>> from nltk.tokenize import word_tokenize
>>> tokenized_docs=[word_tokenize(doc) for doc in text]
>>> print(tokenized_docs)
[['It', 'is', 'a', 'pleasant', 'evening', '.'], ['Guests', ',', 'who', 'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'], ['Food', 'was', 'tasty', '.']]

The preceding code obtains the tokenized text. The following code will remove punctuation from tokenized text:

>>> import re
>>> import string
>>> text=[" It is a pleasant evening.","Guests, who came from US arrived at the venue","Food was tasty."]
>>> from nltk.tokenize import word_tokenize
>>> tokenized_docs=[word_tokenize(doc) for doc in text]
>>> x=re.compile('[%s]' % re.escape(string.punctuation))
>>> tokenized_docs_no_punctuation = []
>>> for review in tokenized_docs:
    new_review = []
    for token in review: 
    new_token = x.sub(u'', token)
    if not new_token == u'':
            new_review.append(new_token)
    tokenized_docs_no_punctuation.append(new_review)	
>>> print(tokenized_docs_no_punctuation)
[['It', 'is', 'a', 'pleasant', 'evening'], ['Guests', 'who', 'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'], ['Food', 'was', 'tasty']]

Conversion into lowercase and uppercase

A given text can be converted into lowercase or uppercase text entirely using the functions lower() and upper(). The task of converting text into uppercase or lowercase falls under the category of normalization.

Consider the following example of case conversion:

>>> text='HARdWork IS KEy to SUCCESS'
>>> print(text.lower())
hardwork is key to success
>>> print(text.upper())
HARDWORK IS KEY TO SUCCESS

Dealing with stop words

Stop words are words that need to be filtered out during the task of information retrieval or other natural language tasks, as these words do not contribute much to the overall meaning of the sentence. There are many search engines that work by deleting stop words so as to reduce the search space. Elimination of stopwords is considered one of the normalization tasks that is crucial in NLP.

NLTK has a list of stop words for many languages. We need to unzip datafile so that the list of stop words can be accessed from nltk_data/corpora/stopwords/:

>>> import nltk
>>> from nltk.corpus import stopwords
>>> stops=set(stopwords.words('english'))
>>> words=["Don't", 'hesitate','to','ask','questions']
>>> [word for word in words if word not in stops]
["Don't", 'hesitate', 'ask', 'questions']

The instance of nltk.corpus.reader.WordListCorpusReader is a stopwords corpus. It has the words() function, whose argument is fileid. Here, it is English; this refers to all the stop words present in the English file. If the words() function has no argument, then it will refer to all the stop words of all the languages.

Other languages in which stop word removal can be done, or the number of languages whose file of stop words is present in NLTK can be found using the fileids() function:

>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish', 'turkish']

Any of these previously listed languages can be used as an argument to the words() function so as to get the stop words in that language.

Calculate stopwords in English

Let's see an example of how to calculate stopwords:

>>> import nltk
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
>>> def para_fraction(text):
stopwords = nltk.corpus.stopwords.words('english')
para = [w for w in text if w.lower() not in stopwords]
return len(para) / len(text)

>>> para_fraction(nltk.corpus.reuters.words())
0.7364374824583169

>>> para_fraction(nltk.corpus.inaugural.words())
0.5229560503653893

Normalization may also involve converting numbers into words (for example, 1 can be replaced by one) and expanding abbreviations (for instance, can't can be replaced by cannot). This can be achieved by representing them in replacement patterns. This is discussed in the next section.

Mastering Natural Language Processing with Python

By : Deepti Chopra, Nisheeth Joshi, Iti Mathur

Mastering Natural Language Processing with Python

By: Deepti Chopra, Nisheeth Joshi, Iti Mathur

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Natural Language Processing with Python

Normalization

Eliminating punctuation

Conversion into lowercase and uppercase

Dealing with stop words

Calculate stopwords in English