Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Creating TF and TF-IDF vectors for the text data


Most of the text analysis data-mining algorithms operate on vector data. We can use a vector space model to represent text data as a set of vectors. For example, we can build a vector space model by taking the set of all terms that appear in the dataset and by assigning an index to each term in the term set. The number of terms in the term set is the dimensionality of the resulting vectors, and each dimension of the vector corresponds to a term. For each document, the vector contains the number of occurrences of each term at the index location assigned to that particular term. This creates the vector space model using term frequencies in each document, which is similar to the result of the computation we performed in the Generating an inverted index using Hadoop MapReduce recipe of Chapter 8, Searching and Indexing.

The vectors can be seen as follows:

The term frequencies and the resulting document vectors

However, creating vectors using the...