Python 3 Text Processing with NLTK 3 Cookbook

Synsets are organized in a hypernym tree. This tree can be used for reasoning about the similarity between the Synsets it contains. The closer the two Synsets are in the tree, the more similar they are.

How to do it...

If you were to look at all the hyponyms of reference_book (which is the hypernym of cookbook), you'd see that one of them is instruction_book. This seems intuitively very similar to a cookbook, so let's see what WordNet similarity has to say about it with the help of the following code:

>>> from nltk.corpus import wordnet
>>> cb = wordnet.synset('cookbook.n.01')
>>> ib = wordnet.synset('instruction_book.n.01')
>>> cb.wup_similarity(ib)
0.9166666666666666

So they are over 91% similar!

How it works...

The wup_similarity method is short for Wu-Palmer Similarity, which is a scoring method based on how similar the word senses are and where the Synsets occur relative to each other in the hypernym tree. One of the core metrics used to calculate similarity is the shortest path distance between the two Synsets and their common hypernym:

>>> ref = cb.hypernyms()[0]
>>> cb.shortest_path_distance(ref)
1
>>> ib.shortest_path_distance(ref)
1
>>> cb.shortest_path_distance(ib)
2

So cookbook and instruction_book must be very similar, because they are only one step away from the same reference_book hypernym, and, therefore, only two steps away from each other.

There's more...

Let's look at two dissimilar words to see what kind of score we get. We'll compare dog with cookbook, two seemingly very different words.

>>> dog = wordnet.synsets('dog')[0]
>>> dog.wup_similarity(cb)
0.38095238095238093

Wow, dog and cookbook are apparently 38% similar! This is because they share common hypernyms further up the tree:

>>> sorted(dog.common_hypernyms(cb))
[Synset('entity.n.01'), Synset('object.n.01'), Synset('physical_entity.n.01'), Synset('whole.n.02')]

Comparing verbs

The previous comparisons were all between nouns, but the same can be done for verbs as well:

>>> cook = wordnet.synset('cook.v.01')
>>> bake = wordnet.0('bake.v.02')
>>> cook.wup_similarity(bake)
00.6666666666666666

The previous Synsets were obviously handpicked for demonstration, and the reason is that the hypernym tree for verbs has a lot more breadth and a lot less depth. While most nouns can be traced up to the hypernym object, thereby providing a basis for similarity, many verbs do not share common hypernyms, making WordNet unable to calculate the similarity. For example, if you were to use the Synset for bake.v.01 in the previous code, instead of bake.v.02, the return value would be None. This is because the root hypernyms of both the Synsets are different, with no overlapping paths. For this reason, you also cannot calculate the similarity between words with different parts of speech.

Path and Leacock Chordorow (LCH) similarity

Two other similarity comparisons are the path similarity and the LCH similarity, as shown in the following code:

>>> cb.path_similarity(ib)
0.3333333333333333
>>> cb.path_similarity(dog)
0.07142857142857142
>>> cb.lch_similarity(ib)
2.538973871058276
>>> cb.lch_similarity(dog)
0.9985288301111273

As you can see, the number ranges are very different for these scoring methods, which is why I prefer the wup_similarity method.

Python 3 Text Processing with NLTK 3 Cookbook

By : Jacob Perkins

Python 3 Text Processing with NLTK 3 Cookbook

By: Jacob Perkins

Overview of this book

Related Content you might be interested in

Current Title:

Python 3 Text Processing with NLTK 3 Cookbook

Calculating WordNet Synset similarity

How to do it...

How it works...

There's more...

Comparing verbs

Path and Leacock Chordorow (LCH) similarity

See also