Book Image

Apache Mahout Essentials

By : Jayani Withanawasam
Book Image

Apache Mahout Essentials

By: Jayani Withanawasam

Overview of this book

Table of Contents (13 chapters)
Apache Mahout Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Text clustering


Text clustering is a widely used application of clustering that is used in areas such as records, management systems, searches, and business intelligence.

The vector space model and TF-IDF

In text clustering, the terms of the documents are considered as features in text clustering. The vector space model is an algebraic model that maps the terms in a document into n-dimensional linear space.

However, we need to represent textual information (terms) as a numerical representation and create feature vectors using the numerical values to evaluate the similarity between data points.

Each dimension of the feature vector represents a separate term. If a particular term is present in the document, then the vector value is set using the Term Frequency (TF) or Term Frequency-Inverse Document Frequency (TF-IDF) calculation. TF indicates the frequency at which the term appears in a given document. TF-IDF is an improved way of TF, which indicates how important a word to a document.

In order...