Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By : Jacob Perkins
Book Image

Python 3 Text Processing with NLTK 3 Cookbook

By: Jacob Perkins

Overview of this book

Table of Contents (17 chapters)
Python 3 Text Processing with NLTK 3 Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Penn Treebank Part-of-speech Tags
Index

Looking up Synsets for a word in WordNet


WordNet is a lexical database for the English language. In other words, it's a dictionary designed specifically for natural language processing.

NLTK comes with a simple interface to look up words in WordNet. What you get is a list of Synset instances, which are groupings of synonymous words that express the same concept. Many words have only one Synset, but some have several. In this recipe, we'll explore a single Synset, and in the next recipe, we'll look at several in more detail.

Getting ready

Be sure you've unzipped the wordnet corpus at nltk_data/corpora/wordnet. This will allow WordNetCorpusReader to access it.

How to do it...

Now we're going to look up the Synset for cookbook, and explore some of the properties and methods of a Synset using the following code:

>>> from nltk.corpus import wordnet
>>> syn = wordnet.synsets('cookbook')[0]
>>> syn.name()
'cookbook.n.01'
>>> syn.definition()
'a book of recipes and cooking directions'

How it works...

You can look up any word in WordNet using wordnet.synsets(word) to get a list of Synsets. The list may be empty if the word is not found. The list may also have quite a few elements, as some words can have many possible meanings, and, therefore, many Synsets.

There's more...

Each Synset in the list has a number of methods you can use to learn more about it. The name() method will give you a unique name for the Synset, which you can use to get the Synset directly:

>>> wordnet.synset('cookbook.n.01')
Synset('cookbook.n.01')

The definition() method should be self-explanatory. Some Synsets also have an examples() method, which contains a list of phrases that use the word in context:

>>> wordnet.synsets('cooking')[0].examples()
['cooking can be a great art', 'people are needed who have experience in cookery', 'he left the preparation of meals to his wife']

Working with hypernyms

Synsets are organized in a structure similar to that of an inheritance tree. More abstract terms are known as hypernyms and more specific terms are hyponyms. This tree can be traced all the way up to a root hypernym.

Hypernyms provide a way to categorize and group words based on their similarity to each other. The Calculating WordNet Synset similarity recipe details the functions used to calculate the similarity based on the distance between two words in the hypernym tree:

>>> syn.hypernyms()
[Synset('reference_book.n.01')]
>>> syn.hypernyms()[0].hyponyms()
[Synset('annual.n.02'), Synset('atlas.n.02'), Synset('cookbook.n.01'), Synset('directory.n.01'), Synset('encyclopedia.n.01'), Synset('handbook.n.01'), Synset('instruction_book.n.01'), Synset('source_book.n.01'), Synset('wordbook.n.01')]
>>> syn.root_hypernyms()
[Synset('entity.n.01')]

As you can see, reference_book is a hypernym of cookbook, but cookbook is only one of the many hyponyms of reference_book. And all these types of books have the same root hypernym, which is entity, one of the most abstract terms in the English language. You can trace the entire path from entity down to cookbook using the hypernym_paths() method, as follows:

>>> syn.hypernym_paths()
[[Synset('entity.n.01'), Synset('physical_entity.n.01'), Synset('object.n.01'), Synset('whole.n.02'), Synset('artifact.n.01'), Synset('creation.n.02'), Synset('product.n.02'), Synset('work.n.02'), Synset('publication.n.01'), Synset('book.n.01'), Synset('reference_book.n.01'), Synset('cookbook.n.01')]]

The hypernym_paths() method returns a list of lists, where each list starts at the root hypernym and ends with the original Synset. Most of the time, you'll only get one nested list of Synsets.

Part of speech (POS)

You can also look up a simplified part-of-speech tag as follows:

>>> syn.pos()
'n'

There are four common part-of-speech tags (or POS tags) found in WordNet, as shown in the following table:

Part of speech

Tag

Noun

n

Adjective

a

Adverb

r

Verb

v

These POS tags can be used to look up specific Synsets for a word. For example, the word 'great' can be used as a noun or an adjective. In WordNet, 'great' has 1 noun Synset and 6 adjective Synsets, as shown in the following code:

>>> len(wordnet.synsets('great'))
7
>>> len(wordnet.synsets('great', pos='n'))
1
>>> len(wordnet.synsets('great', pos='a'))
6

These POS tags will be referenced more in the Using WordNet for tagging recipe in Chapter 4, Part-of-speech Tagging.

See also

In the next two recipes, we'll explore lemmas and how to calculate Synset similarity. And in Chapter 2, Replacing and Correcting Words, we'll use WordNet for lemmatization, synonym replacement, and then explore the use of antonyms.