Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By : Jacob Perkins
Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By: Jacob Perkins

Overview of this book

Table of Contents (17 chapters)
Python 3 Text Processing with NLTK 3 Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Penn Treebank Part-of-speech Tags
Index

Filtering insignificant words from a sentence


Many of the most commonly used words are insignificant when it comes to discerning the meaning of a phrase. For example, in the phrase the movie was terrible, the most significant words are movie and terrible, while the and was are almost useless. You could get the same meaning if you took them out, that is, movie terrible or terrible movie. Either way, the sentiment is the same. In this recipe, we'll learn how to remove the insignificant words and keep the significant ones by looking at their part-of-speech tags.

Getting ready

First, we need to decide which part-of-speech tags are significant and which are not. Looking through the treebank corpus for stopwords yields the following table of insignificant words and tags:

Word

Tag

a

DT

all

PDT

an

DT

and

CC

or

CC

that

WDT

the

DT

Other than CC, all the tags end with DT. This means we can filter out insignificant words by looking at the tag's suffix. Refer to Appendix A, Penn Treebank...