Python Natural Language Processing Cookbook

By : Zhenya Antić

Python Natural Language Processing Cookbook

By: Zhenya Antić

Overview of this book

Python is the most widely used language for natural language processing (NLP) thanks to its extensive tools and libraries for analyzing text and extracting computer-usable data. This book will take you through a range of techniques for text processing, from basics such as parsing the parts of speech to complex topics such as topic modeling, text classification, and visualization. Starting with an overview of NLP, the book presents recipes for dividing text into sentences, stemming and lemmatization, removing stopwords, and parts of speech tagging to help you to prepare your data. You’ll then learn ways of extracting and representing grammatical information, such as dependency parsing and anaphora resolution, discover different ways of representing the semantics using bag-of-words, TF-IDF, word embeddings, and BERT, and develop skills for text classification using keywords, SVMs, LSTMs, and other techniques. As you advance, you’ll also see how to extract information from text, implement unsupervised and supervised techniques for topic modeling, and perform topic modeling of short texts, such as tweets. Additionally, the book shows you how to develop chatbots using NLTK and Rasa and visualize text data. By the end of this NLP book, you’ll have developed the skills to use a powerful set of tools for text processing.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Conventions used

Sections

Get in touch

Reviews

Chapter 1: Learning NLP Basics

Technical requirements

Dividing text into sentences

Dividing sentences into words – tokenization

Parts of speech tagging

Word stemming

Combining similar words – lemmatization

Removing stopwords

Free Chapter

Chapter 2: Playing with Grammar

Technical requirements

Counting nouns – plural and singular nouns

Getting the dependency parse

Splitting sentences into clauses

Extracting noun chunks

Extracting entities and relations

Extracting subjects and objects of the sentence

Finding references – anaphora resolution

Chapter 3: Representing Text – Capturing Semantics

Technical requirements

Putting documents into a bag of words

Constructing the N-gram model

Representing texts with TF-IDF

Using word embeddings

Training your own embeddings model

Representing phrases – phrase2vec

Using BERT instead of word embeddings

Getting started with semantic search

Chapter 4: Classifying Texts

Technical requirements

Getting the dataset and evaluation baseline ready

Performing rule-based text classification using keywords

Clustering sentences using K-means – unsupervised text classification

Using SVMs for supervised text classification

Using LSTMs for supervised text classification

Chapter 5: Getting Started with Information Extraction

Technical requirements

Using regular expressions

Performing named entity recognition using spaCy

Training your own NER model with spaCy

Discovering sentiment analysis

Sentiment for short texts using LSTM: Twitter

Using BERT for sentiment analysis

Chapter 6: Topic Modeling

Technical requirements

LDA topic modeling with sklearn

LDA topic modeling with gensim

NMF topic modeling

K-means topic modeling with BERT

Topic modeling of short texts

Chapter 7: Building Chatbots

Technical requirements

Building a basic chatbot with keyword matching

Building a basic Rasa chatbot

Creating question-answer pairs with Rasa

Creating and visualizing conversation paths with Rasa

Creating actions for the Rasa chatbot

Chapter 8: Visualizing Text Data

Technical requirements

Visualizing the dependency parse

Visualizing parts of speech

Visualizing NER

Constructing word clouds

Visualizing topics

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Getting the dependency parse

A dependency parse is a tool that shows dependencies in a sentence. For example, in the sentence The cat wore a hat, the root of the sentence in the verb, wore, and both the subject, the cat, and the object, a hat, are dependents. The dependency parse can be very useful in many NLP tasks since it shows the grammatical structure of the sentence, along with the subject, the main verb, the object, and so on. It can be then used in downstream processing.

Getting ready

We will use spacy to create the dependency parse. If you already downloaded it while working on the previous chapter, you do not need to do anything more. Otherwise, please follow the instructions at the beginning of Chapter 1, Learning NLP Basics, to install the necessary packages.

How to do it…

We will take a few sentences from the sherlock_holmes1.txt file to illustrate the dependency parse. The steps are as follows:

Import spacy:
```
import spacy
```

Load the sentence to be parsed:

sentence = 'I have seldom heard him mention her under any other name.'

Load the spacy engine:
```
nlp = spacy.load('en_core_web_sm')
```
Process the sentence using the spacy engine:
```
doc = nlp(sentence)
```
The dependency information will be contained in the doc object. We can see the dependency tags by looping through the tokens in doc:
```
for token in doc:
    print(token.text, "\t", token.dep_, "\t",
    spacy.explain(token.dep_))
```

The result will be as follows. To learn what each of the tags means, use spaCy's explain function, which shows the meanings of the tags:

I        nsubj   nominal subject
have     aux     auxiliary
seldom   advmod          adverbial modifier
heard    ROOT    None
him      nsubj   nominal subject
mention          ccomp   clausal complement
her      dobj    direct object
under    prep    prepositional modifier
any      det     determiner
other    amod    adjectival modifier
name     pobj    object of preposition
.        punct   punctuation

To explore the dependency parse structure, we can use the attributes of the Token class. Using its ancestors and children attributes, we can get the tokens that this token depends on and the tokens that depend on it, respectively. The code to get these ancestors is as follows:

for token in doc:
    print(token.text)
    ancestors = [t.text for t in token.ancestors]
    print(ancestors)

The output will be as follows:

I
['heard']
have
['heard']
seldom
['heard']
heard
[]
him
['mention', 'heard']
mention
['heard']
her
['mention', 'heard']
under
['mention', 'heard']
any
['name', 'under', 'mention', 'heard']
other
['name', 'under', 'mention', 'heard']
name
['under', 'mention', 'heard']
.
['heard']

To see all the children token, use the following code:

for token in doc:
    print(token.text)
    children = [t.text for t in token.children]
    print(children)

The output will be as follows:

I
[]
have
[]
seldom
[]
heard
['I', 'have', 'seldom', 'mention', '.']
him
[]
mention
['him', 'her', 'under']
her
[]
under
['name']
any
[]
other
[]
name
['any', 'other']
.
[]

We can also see the subtree that the token is in:

for token in doc:
    print(token.text)
    subtree = [t.text for t in token.subtree]
    print(subtree)

This will produce the following output:

I
['I']
have
['have']
seldom
['seldom']
heard
['I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other', 'name', '.']
him
['him']
mention
['him', 'mention', 'her', 'under', 'any', 'other', 'name']
her
['her']
under
['under', 'any', 'other', 'name']
any
['any']
other
['other']
name
['any', 'other', 'name']
.
['.']

How it works…

The spacy NLP engine does the dependency parse as part of its overall analysis. The dependency parse tags explain the role of each word in the sentence. ROOT is the main word that all the other words depend on, usually the verb.

From the subtrees that each word is part of, we can see the grammatical phrases that appear in the sentence, such as the noun phrase (NP) any other name and prepositional phrase (PP) under any other name.

The dependency chain can be seen by following the ancestor links for each word. For example, if we look at the word name, we will see that its ancestors are under, mention, and heard. The immediate parent of name is under, under's parent is mention, and mention's parent is heard. A dependency chain will always lead to the root, or the main word, of the sentence.

In step 1, we import the spacy package. In step 2, we initialize the variable sentence that contains the sentence to be parsed. In step 3, we load the spacy engine and in step 4, we use the engine to process the sentence.

In step 5, we print out each token's dependency tag and use the spacy.explain function to see what those tags mean.

In step 6, we print out the ancestors of each token. The ancestors will start at the parent and go up until they reach the root. For example, the parent of him is mention, and the parent of mention is heard, so both mention and heard are listed as ancestors of him.

In step 7, we print children of each token. Some tokens, such as have, do not have any children, while others have several. The token that will always have children, unless the sentence consists of one word, is the root of the sentence; in this case, heard.

In step 8, we print the subtree for each token. For example, the word under is in the subtree under any other name.

Python Natural Language Processing Cookbook

By : Zhenya Antić

Python Natural Language Processing Cookbook

By: Zhenya Antić

Overview of this book

Related Content you might be interested in

Current Title:

Python Natural Language Processing Cookbook

Mastering spaCy

Natural Language Processing with Python Quick Start Guide

Natural Language Processing and Computational Linguistics

Getting the dependency parse

Getting ready

How to do it…

How it works…

See also