-
Book Overview & Buying
-
Table Of Contents
The Natural Language Processing Workshop
By :
So far in this chapter, we have learned how to generate vectors from text. These vectors are then fed to machine learning algorithms to perform various tasks. Other than using them in machine learning applications, we can also perform simple NLP tasks using these vectors. Finding the string similarity is one of them. This is a technique in which we find the similarity between two strings by converting them into vectors. The technique is mainly used in full-text searching.
There are different techniques for finding the similarity between two strings or texts. They are explained one by one here:

Figure 2.26: Cosine similarity
Here, A and B are two vectors, A.B is the dot product of two vectors, and |A| and |B| are the magnitude of two vectors.
Consider the following example. Suppose there are two texts:
Text 1: I like detective Byomkesh Bakshi.
Text 2: Byomkesh Bakshi is not a detective; he is a truth seeker.
The common terms are "Byomkesh," "Bakshi," and "detective."
The number of common terms in the texts is three.
The unique terms present in the texts are "I," "like," "is," "not," "a," "he," "is," "truth," and "seeker." So, the number of unique terms is nine.
Therefore, the Jaccard similarity is 3/9 = 0.3.
To get a better understanding of text similarity, we will complete an exercise.
In this exercise, we will calculate the Jaccard and cosine similarity for a given pair of texts. Follow these steps to complete this exercise:
from nltk import word_tokenize from nltk.stem import WordNetLemmatizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity lemmatizer = WordNetLemmatizer()
def extract_text_similarity_jaccard(text1, text2): """ This method will return Jaccard similarity between two texts after lemmatizing them. :param text1: text1 :param text2: text2 :return: similarity measure """ lemmatizer = WordNetLemmatizer() words_text1 = [lemmatizer.lemmatize(word.lower()) \ for word in word_tokenize(text1)] words_text2 = [lemmatizer.lemmatize(word.lower()) \ for word in word_tokenize(text2)] nr = len(set(words_text1).intersection(set(words_text2))) dr = len(set(words_text1).union(set(words_text2))) jaccard_sim = nr / dr return jaccard_sim
pair1, pair2, and pair3, as follows.pair1 = ["What you do defines you", "Your deeds define you"] pair2 = ["Once upon a time there lived a king.", \ "Who is your queen?"] pair3 = ["He is desperate", "Is he not desperate?"]
pair1, write the following code: extract_text_similarity_jaccard(pair1[0],pair1[1])
The preceding code generates the following output:
0.14285714285714285
pair2, write the following code:extract_text_similarity_jaccard(pair2[0],pair2[1])
The preceding code generates the following output:
0.0
pair3, write the following code:extract_text_similarity_jaccard(pair3[0],pair3[1])
The preceding code generates the following output:
0.6
TfidfVectorizer() method to get the vectors of each text: def get_tf_idf_vectors(corpus): tfidf_vectorizer = TfidfVectorizer() tfidf_results = tfidf_vectorizer.fit_transform(corpus).\ todense() return tfidf_results
corpus = [pair1[0], pair1[1], pair2[0], \ pair2[1], pair3[0], pair3[1]] tf_idf_vectors = get_tf_idf_vectors(corpus)
cosine_similarity(tf_idf_vectors[0],tf_idf_vectors[1])
The preceding code generates the following output:
array([[0.3082764]])
cosine_similarity(tf_idf_vectors[2],tf_idf_vectors[3])
The preceding code generates the following output:
array([[0.]])
cosine_similarity(tf_idf_vectors[4],tf_idf_vectors[5])
The preceding code generates the following output:
array([[0.80368547]])
So, in this exercise, we learned how to check the similarity between texts. As you can see, the texts "He is desperate" and "Is he not desperate?" returned similarity results of 0.80 (meaning they are highly similar), whereas sentences such as "Once upon a time there lived a king." and "Who is your queen?" returned zero as their similarity measure.
Note
To access the source code for this specific section, please refer to https://packt.live/2Eyw0JC.
You can also run this example online at https://packt.live/2XbGRQ3.
The Lesk algorithm is used for resolving word sense disambiguation. Suppose we have a sentence such as "On the bank of river Ganga, there lies the scent of spirituality" and another sentence, "I'm going to withdraw some cash from the bank". Here, the same word—that is, "bank"—is used in two different contexts. For text processing results to be accurate, the context of the words needs to be considered.
In the Lesk algorithm, words with ambiguous meanings are stored in the background in synsets. The definition that is closer to the meaning of a word being used in the context of the sentence will be taken as the right definition. Let's perform a simple exercise to get a better idea of how we can implement this.
In this exercise, we are going to implement the Lesk algorithm step by step using the techniques we have learned so far. We will find the meaning of the word "bank" in the sentence, "On the banks of river Ganga, there lies the scent of spirituality." We will use cosine similarity as well as Jaccard similarity here. Follow these steps to complete this exercise:
import pandas as pd from sklearn.metrics.pairwise import cosine_similarity from nltk import word_tokenize from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.datasets import fetch_20newsgroups import numpy as np
def get_tf_idf_vectors(corpus): tfidf_vectorizer = TfidfVectorizer() tfidf_results = tfidf_vectorizer.fit_transform\ (corpus).todense() return tfidf_results
def to_lower_case(corpus): lowercase_corpus = [x.lower() for x in corpus] return lowercase_corpus
def find_sentence_definition(sent_vector,defnition_vectors):
"""
This method will find cosine similarity of sentence with
the possible definitions and return the one with
highest similarity score along with the similarity score.
"""
result_dict = {}
for definition_id,def_vector in definition_vectors.items():
sim = cosine_similarity(sent_vector,def_vector)
result_dict[definition_id] = sim[0][0]
definition = sorted(result_dict.items(), \
key=lambda x: x[1], \
reverse=True)[0]
return definition[0],definition[1]corpus = ["On the banks of river Ganga, there lies the scent "\ "of spirituality",\ "An institute where people can store extra "\ "cash or money.",\ "The land alongside or sloping down to a river or lake" "What you do defines you",\ "Your deeds define you",\ "Once upon a time there lived a king.",\ "Who is your queen?",\ "He is desperate",\ "Is he not desperate?"]
lower_case_corpus = to_lower_case(corpus)
corpus_tf_idf = get_tf_idf_vectors(lower_case_corpus)
sent_vector = corpus_tf_idf[0]
definition_vectors = {'def1':corpus_tf_idf[1],\
'def2':corpus_tf_idf[2]}
definition_id, score = \
find_sentence_definition(sent_vector,definition_vectors)
print("The definition of word {} is {} with similarity of {}".\
format('bank',definition_id,score))You will get the following output:
The definition of word bank is def2 with similarity of 0.14419130686278897
As we already know, def2 represents a riverbank. So, we have found the correct definition of the word here. In this exercise, we have learned how to use text vectorization and text similarity to find the right definition of ambiguous words.
Note
To access the source code for this specific section, please refer to https://packt.live/39GzJAs.
You can also run this example online at https://packt.live/3fbxQwK.
Unlike numeric data, there are very few ways in which text data can be represented visually. The most popular way of visualizing text data is by using word clouds. A word cloud is a visualization of a text corpus in which the sizes of the tokens (words) represent the number of times they have occurred, as shown in the following image:
Figure 2.27: Example of a word cloud
In the following exercise, we will be using a Python library called wordcloud to build a word cloud from the 20newsgroups dataset.
Let's go through an exercise to understand this better.
In this exercise, we will visualize the most frequently occurring words in the first 1,000 articles from sklearn's fetch_20newsgroups text dataset using a word cloud. Follow these steps to complete this exercise:
import nltk
nltk.download('stopwords')
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 200
from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 200get_data() method to fetch the data:def get_data(n): newsgroups_data_sample = fetch_20newsgroups(subset='train') text = str(newsgroups_data_sample['data'][:n]) return text
def load_stop_words():
other_stopwords_to_remove = ['\\n', 'n', '\\', '>', \
'nLines', 'nI',"n'"]
stop_words = stopwords.words('english')
stop_words.extend(other_stopwords_to_remove)
stop_words = set(stop_words)
return stop_wordsgenerate_word_cloud() method to generate a word cloud object:def generate_word_cloud(text, stopwords): """ This method generates word cloud object with given corpus, stop words and dimensions """ wordcloud = WordCloud(width = 800, height = 800, \ background_color ='white', \ max_words=200, \ stopwords = stopwords, \ min_font_size = 10).generate(text) return wordcloud
20newsgroup data, get the stop word list, generate a word cloud object, and finally plot the word cloud with matplotlib:text = get_data(1000)
stop_words = load_stop_words()
wordcloud = generate_word_cloud(text, stop_words)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()The preceding code generates the following output:

Figure 2.28: Word cloud representation of the first 10 articles
So, in this exercise, we learned what word clouds are and how to generate word clouds with Python's wordcloud library and visualize this with matplotlib.
Note
To access the source code for this specific section, please refer to https://packt.live/30eaSRn.
You can also run this example online at https://packt.live/2EzqLJJ.
In the next section, we will explore other visualizations, such as dependency parse trees and named entities.
Apart from word clouds, there are various other ways of visualizing texts. Some of the most popular ways are listed here:
Let's go through the following exercise to understand this better.
In this exercise, we will look at two of the most popular visualization methods, after word clouds, which are dependency parse trees and using named entities. Follow these steps to complete this exercise:
import spacy from spacy import displacy !python -m spacy download en_core_web_sm import en_core_web_sm nlp = en_core_web_sm.load()
doc = nlp('God helps those who help themselves')
displacy.render(doc, style='dep', jupyter=True)The preceding code generates the following output:

Figure 2.29: Dependency parse tree
text = 'Once upon a time there lived a saint named '\ 'Ramakrishna Paramahansa. His chief disciple '\ 'Narendranath Dutta also known as Swami Vivekananda '\ 'is the founder of Ramakrishna Mission and '\ 'Ramakrishna Math.' doc2 = nlp(text) displacy.render(doc2, style='ent', jupyter=True)
The preceding code generates the following output:

Figure 2.30: Named entities
Note
To access the source code for this specific section, please refer to https://packt.live/313m4iD.
You can also run this example online at https://packt.live/3103fgr.
Now that you have learned about visualizations, we will solve an activity based on them to gain an even better understanding.
In this activity, you will create a word cloud for the 50 most frequent words in a dataset. The dataset we will use consists of random sentences that are not clean. First, we need to clean them and create a unique set of frequently occurring words.
Note
The text_corpus.txt file that's being used in this activity can be found at https://packt.live/2DiVIBj.
Follow these steps to implement this activity:
Note
The solution for this activity can be found via this link.
Change the font size
Change margin width
Change background colour