As we understood the limitation of count vectorization that a highly frequent word might spoil the party. Hence, the idea is to penalize the frequent words occurring in most of the documents by assigning them a lower weight and increasing the weight of the words that appear in a subset of documents. This is the principle upon which TF-IDF works.
TF-IDF is a measure of how important a term is with respect to a document and the entire corpus (collection of documents):
TF-IDF(term) = TF(term)* IDF(term)
Term frequency (TF) is the frequency of the word appearing in the document out of all the words in the same document. For example, if there are 1,000 words in a document and we have to find out the TF of a word NLP that has appeared 50 times in that very document, we use the following:
TF(NLP)= 50/1000=0.05
Hence, we can conclude the following:
TF(term) = Number of times the term appears in the document/total number of terms in the document
In the preceding example , comprised of three documents...