Term Frequency/Inverse Document Frequency (TF-IDF) is a simple ranking algorithm useful when working with text. Search engines, text classification, text summarization, and other applications rely on sophisticated models of TF-IDF. The algorithm is based on term frequency (the number of times the term t
occurs in document d
) and document frequency (the number of documents in which the term t
occurs). Inverse document frequency is the log of the total number of documents, N
, divided by the document frequency.
The basic idea is that common words, such as the word the, should receive a smaller significance compared to words that appear less frequently in documents.
We will use a collection of 62 books as an example dataset. A document consists of the title of the book and the actual text. For example:
In Book A (ASHPUTTEL), the word the is repeated 184 times and the word child two times. In Book B (Cat And Mouse In Partnership), the word the is repeated 73 times and...