-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
The Handbook of NLP with Gensim
By :
Let’s first do BoW and TF-IDF. We learned how to prepare BoW and TF-IDF in Chapter 2, Text Representation. BoW is actually the count frequency of words, while its variation, TF-IDF, is designed to reflect the importance of a word in a document of a corpus.
We will first use the Dictionary class to build and manage dictionaries of terms (words or tokens). It creates a mapping between unique terms in a corpus and their integer IDs. This is actually the BoW:
from gensim.corpora import Dictionarygensim_dictionary = Dictionary()
Let’s examine the dictionary list object, gensim_dictionary. How many unique words are in it? Let’s check the length of this list to get the number of words:
len(gensim_dictionary)
We get the following output:
40360
So, there are 40,360 words!
Now, we will create the BoW.
We create the BoW by using the .doc2bow() function:
bow_corpus = [gensim_dictionary.doc2bow...