Book Image

Programming MapReduce with Scalding

By : Antonios Chalkiopoulos
Book Image

Programming MapReduce with Scalding

By: Antonios Chalkiopoulos

Overview of this book

Table of Contents (16 chapters)
Programming MapReduce with Scalding
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Text similarity using TF-IDF


Term Frequency/Inverse Document Frequency (TF-IDF) is a simple ranking algorithm useful when working with text. Search engines, text classification, text summarization, and other applications rely on sophisticated models of TF-IDF. The algorithm is based on term frequency (the number of times the term t occurs in document d) and document frequency (the number of documents in which the term t occurs). Inverse document frequency is the log of the total number of documents, N, divided by the document frequency.

The basic idea is that common words, such as the word the, should receive a smaller significance compared to words that appear less frequently in documents.

We will use a collection of 62 books as an example dataset. A document consists of the title of the book and the actual text. For example:

In Book A (ASHPUTTEL), the word the is repeated 184 times and the word child two times. In Book B (Cat And Mouse In Partnership), the word the is repeated 73 times and...