Python Natural Language Processing Cookbook

By : Zhenya Antić

Python Natural Language Processing Cookbook

By: Zhenya Antić

Overview of this book

Python is the most widely used language for natural language processing (NLP) thanks to its extensive tools and libraries for analyzing text and extracting computer-usable data. This book will take you through a range of techniques for text processing, from basics such as parsing the parts of speech to complex topics such as topic modeling, text classification, and visualization. Starting with an overview of NLP, the book presents recipes for dividing text into sentences, stemming and lemmatization, removing stopwords, and parts of speech tagging to help you to prepare your data. You’ll then learn ways of extracting and representing grammatical information, such as dependency parsing and anaphora resolution, discover different ways of representing the semantics using bag-of-words, TF-IDF, word embeddings, and BERT, and develop skills for text classification using keywords, SVMs, LSTMs, and other techniques. As you advance, you’ll also see how to extract information from text, implement unsupervised and supervised techniques for topic modeling, and perform topic modeling of short texts, such as tweets. Additionally, the book shows you how to develop chatbots using NLTK and Rasa and visualize text data. By the end of this NLP book, you’ll have developed the skills to use a powerful set of tools for text processing.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Conventions used

Sections

Get in touch

Reviews

Chapter 1: Learning NLP Basics

Technical requirements

Dividing text into sentences

Dividing sentences into words – tokenization

Parts of speech tagging

Word stemming

Combining similar words – lemmatization

Removing stopwords

Free Chapter

Chapter 2: Playing with Grammar

Technical requirements

Counting nouns – plural and singular nouns

Getting the dependency parse

Splitting sentences into clauses

Extracting noun chunks

Extracting entities and relations

Extracting subjects and objects of the sentence

Finding references – anaphora resolution

Chapter 3: Representing Text – Capturing Semantics

Technical requirements

Putting documents into a bag of words

Constructing the N-gram model

Representing texts with TF-IDF

Using word embeddings

Training your own embeddings model

Representing phrases – phrase2vec

Using BERT instead of word embeddings

Getting started with semantic search

Chapter 4: Classifying Texts

Technical requirements

Getting the dataset and evaluation baseline ready

Performing rule-based text classification using keywords

Clustering sentences using K-means – unsupervised text classification

Using SVMs for supervised text classification

Using LSTMs for supervised text classification

Chapter 5: Getting Started with Information Extraction

Technical requirements

Using regular expressions

Performing named entity recognition using spaCy

Training your own NER model with spaCy

Discovering sentiment analysis

Sentiment for short texts using LSTM: Twitter

Using BERT for sentiment analysis

Chapter 6: Topic Modeling

Technical requirements

LDA topic modeling with sklearn

LDA topic modeling with gensim

NMF topic modeling

K-means topic modeling with BERT

Topic modeling of short texts

Chapter 7: Building Chatbots

Technical requirements

Building a basic chatbot with keyword matching

Building a basic Rasa chatbot

Creating question-answer pairs with Rasa

Creating and visualizing conversation paths with Rasa

Creating actions for the Rasa chatbot

Chapter 8: Visualizing Text Data

Technical requirements

Visualizing the dependency parse

Visualizing parts of speech

Visualizing NER

Constructing word clouds

Visualizing topics

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Counting nouns – plural and singular nouns

In this recipe, we will do two things:

Determine whether a noun is plural or singular
Turn plural nouns into singular nouns and vice versa

You might need these two things in a variety of tasks: in making your chatbot speak in grammatically correct sentences, in coming up with text classification features, and so on.

Getting ready

We will be using nltk for this task, as well as the inflect module we described in Technical requirements section. The code for this chapter is located in the Chapter02 directory of this book's GitHub repository. We will be working with the first part of the Adventures of Sherlock Holmes text, available in the sherlock_holmes_1.txt file.

How to do it…

We will be using code from Chapter 1, Learning NLP Basics, to tokenize the text into words and tag them with parts of speech. Then, we will use one of two ways to determine if a noun is singular or plural, and then use the inflect module to change the number of the noun.

Your steps should be formatted like so:

Do the necessary imports:

import nltk
from nltk.stem import WordNetLemmatizer
import inflect
from Chapter01.pos_tagging import pos_tag_nltk

Read in the text file:

file = open(filename, "r", encoding="utf-8")
sherlock_holmes_text = file.read()

Remove newlines for better readability:

sherlock_holmes_text = sherlock_holmes_text.replace("\n", " ")

Do part of speech tagging:

words_with_pos = pos_tag_nltk(sherlock_holmes_text)

Define the get_nouns function, which will filter out the nouns from all the words:

def get_nouns(words_with_pos):
    noun_set = ["NN", "NNS"]
    nouns = [word for word in words_with_pos if 
             word[1] in noun_set]
    return nouns

Run the preceding function on the list of POS-tagged words and print it:

nouns = get_nouns(words_with_pos)
print(nouns)

The resulting list will be as follows:

[('woman', 'NN'), ('name', 'NN'), ('eyes', 'NNS'), ('whole', 'NN'), ('sex', 'NN'), ('emotion', 'NN'), ('akin', 'NN'), ('emotions', 'NNS'), ('cold', 'NN'), ('precise', 'NN'), ('mind', 'NN'), ('reasoning', 'NN'), ('machine', 'NN'), ('world', 'NN'), ('lover', 'NN'), ('position', 'NN'), ('passions', 'NNS'), ('gibe', 'NN'), ('sneer', 'NN'), ('things', 'NNS'), ('observer—excellent', 'NN'), ('veil', 'NN'), ('men', 'NNS'), ('motives', 'NNS'), ('actions', 'NNS'), ('reasoner', 'NN'), ('intrusions', 'NNS'), ('delicate', 'NN'), ('temperament', 'NN'), ('distracting', 'NN'), ('factor', 'NN'), ('doubt', 'NN'), ('results', 'NNS'), ('instrument', 'NN'), ('crack', 'NN'), ('high-power', 'NN'), ('lenses', 'NNS'), ('emotion', 'NN'), ('nature', 'NN'), ('woman', 'NN'), ('woman', 'NN'), ('memory', 'NN')]

To determine whether a noun is singular or plural, we have two options. The first option is to use the NLTK tags, where NN indicates a singular noun and NNS indicates a plural noun. The following function uses the NLTK tags and returns True if the input noun is plural:
```
def is_plural_nltk(noun_info):
    pos = noun_info[1]
    if (pos == "NNS"):
        return True
    else:
        return False
```

The other option is to use the WordNetLemmatizer class in the nltk.stem package. The following function returns True if the noun is plural:

def is_plural_wn(noun):
    wnl = WordNetLemmatizer()
    lemma = wnl.lemmatize(noun, 'n')
    plural = True if noun is not lemma else False
    return plural

The following function will change a singular noun into plural:

def get_plural(singular_noun):
    p = inflect.engine()
    return p.plural(singular_noun)

The following function will change a plural noun into singular:

def get_singular(plural_noun):
    p = inflect.engine()
    plural = p.singular_noun(plural_noun)
    if (plural):
        return plural
    else:
        return plural_noun

We can now use the two preceding functions to return a list of nouns changed into plural or singular, depending on the original noun. The following code uses the is_plural_wn function to determine if the noun is plural. You can also use the is_plural_nltk function:

def plurals_wn(words_with_pos):
    other_nouns = []
    for noun_info in words_with_pos:
        word = noun_info[0]
        plural = is_plural_wn(word)
        if (plural):
            singular = get_singular(word)
            other_nouns.append(singular)
        else:
            plural = get_plural(word)
            other_nouns.append(plural)
    return other_nouns

Use the preceding function to return a list of changed nouns:

other_nouns_wn = plurals_wn(nouns)

The result will be as follows:

['women', 'names', 'eye', 'wholes', 'sexes', 'emotions', 'akins', 'emotion', 'colds', 'precises', 'minds', 'reasonings', 'machines', 'worlds', 'lovers', 'positions', 'passion', 'gibes', 'sneers', 'thing', 'observer—excellents', 'veils', 'mens', 'motive', 'action', 'reasoners', 'intrusion', 'delicates', 'temperaments', 'distractings', 'factors', 'doubts', 'result', 'instruments', 'cracks', 'high-powers', 'lens', 'emotions', 'natures', 'women', 'women', 'memories']

How it works…

Number detection works in one of two ways. One is by reading the part of speech tag assigned by NLTK. If the tag is NN, then the noun is singular, and if it is NNS, then it's plural. The other way is to use the WordNet lemmatizer and to compare the lemma and the original word. The noun is singular if the lemma and the original input noun are the same, and plural otherwise.

To find the singular form of a plural noun and the plural form of a singular noun, we can use the inflect package. Its plural and singular_noun methods return the correct forms.

In step 1, we import the necessary modules and functions. You can find the pos_tag_nltk function in this book's GitHub repository, in the Chapter01 module, in the pos_tagging.py file It uses the code we wrote for Chapter 1, Learning NLP Basics. In step 2, we read in the file's contents into a string. In step 3, we remove newlines from the text; this is an optional step. In step 4, we use the pos_tag_nltk function defined in the code from the previous chapter to tag parts of speech for the words.

In step 5, we create the get_nouns function, which filters out the words that are singular or plural nouns. In this function, we use a list comprehension and keep only words that have the NN or NNS tags.

In step 6, we run the preceding function on the word list and print the result. As you will notice, NLTK tags several words incorrectly as nouns, such as cold and precise. These errors will propagate into the next steps, and it is something to keep in mind when working with NLP tasks.

In steps 7 and 8, we define two functions to determine whether a noun is singular or plural. In step 7, we define the is_plural_nltk function, which uses NLTK POS tagging information to determine if the noun is plural. In step 8, we define the is_plural_wn function, which compares the noun with its lemma, as determined by the NLTK lemmatizer. If those two forms are the same, the noun is singular, and if they are different, the noun is plural. Both functions can return incorrect results that will propagate downstream.

In step 9, we define the get_plural function, which will return the plural form of the noun by using the inflect package. In step 10, we define the get_singular function, which uses the same package to get the singular form of the noun. If there is no output from inflect, the function returns the input.

In step 11, we define the plurals_wn function, which takes in a list of words with the parts of speech that we got in step 6 and changes plural nouns into singular and singular nouns into plural.

In step 12, we run the plurals_wn function on the nouns list. Most of the words are changed correctly; for example, women and emotion. We also see two kinds of error propagation, where either the part of speech or number of the noun were determined incorrectly. For example, the word akins appears here because akin was incorrectly labeled as a noun. On the other hand, the word men was incorrectly determined to be singular and resulted in the wrong output; that is, mens.

There's more…

The results will differ, depending on which is_plural/is_singular function you use. If you tag the word men with its part of speech, you will see that NLTK returns the NNS tag, which means that the word is plural. You can experiment with different inputs and see which function works best for you.

Python Natural Language Processing Cookbook

By : Zhenya Antić

Python Natural Language Processing Cookbook

By: Zhenya Antić

Overview of this book

Related Content you might be interested in

Current Title:

Python Natural Language Processing Cookbook

Mastering spaCy

Natural Language Processing with Python Quick Start Guide

Natural Language Processing and Computational Linguistics

Counting nouns – plural and singular nouns

Getting ready

How to do it…

How it works…

There's more…