Programming MapReduce with Scalding

Term Frequency/Inverse Document Frequency (TF-IDF) is a simple ranking algorithm useful when working with text. Search engines, text classification, text summarization, and other applications rely on sophisticated models of TF-IDF. The algorithm is based on term frequency (the number of times the term t occurs in document d) and document frequency (the number of documents in which the term t occurs). Inverse document frequency is the log of the total number of documents, N, divided by the document frequency.

The basic idea is that common words, such as the word the, should receive a smaller significance compared to words that appear less frequently in documents.

We will use a collection of 62 books as an example dataset. A document consists of the title of the book and the actual text. For example:

In Book A (ASHPUTTEL), the word the is repeated 184 times and the word child two times. In Book B (Cat And Mouse In Partnership), the word the is repeated 73 times and...

Programming MapReduce with Scalding

By : Antonios Chalkiopoulos

Programming MapReduce with Scalding

By: Antonios Chalkiopoulos

Overview of this book

Related Content you might be interested in

Current Title:

Programming MapReduce with Scalding

Text similarity using TF-IDF