Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Preparing data for use with clustering techniques


Machine learning algorithms such as clustering and classification use input data in vector formats. We need to map features of an object in the numeric form, which will be helpful in creating the vectors. The vector space model is used to vectorize text documents. Let's understand this model in more detail.

In this model, first we will create a dictionary of all the words present in the document in such a way that we assign a particular index to each word. As some of the words will occur more frequently in any document, such as a, as, the, that, this, was, and so on. These words are also called stop words and will not help us, so we will ignore these words. Now, we have a dictionary of the words that are indexed. So, let's assume that we have two documents, Doc1 and Doc2, containing the following content:

  • Doc1: We will learn clustering in this book

  • Doc2: We have different clustering algorithms

Now, we will have the following index dictionary:

[learn , cluster, book, different, algorithm]

[1,2,3,4,5]

So, we have created a vector space of the document, where the first term is learn, the second is cluster, and so on. Now, we will use the term-frequency to represent each term in our vector space. The term-frequency is nothing but the count of how many times the words present in our dictionary are present in the documents. So, each of the Doc1 and Doc2 can be presented by five-dimensional vectors as [1,1,1,0,0] and [0,1,0,1,1] (the number of times each word in particular index occurred in the Doc).

When we try to find the similarity between the documents based on distance measures, we often encounter a problem. Most occurring words in the documents increase the weight. We can find out that two documents could be same if they are talking about the same topic, and for the same topic, words are usually rare and occur less often occurring words. So, to give more weightage for these words, there is a technique that is called Inverse document frequency.

Inverse document frequency can be considered as the boost a term gets for being rare. A term should not be too common. If a term is occurring in every document, it is not good for clustering. The fewer documents in which a term occurs, the more significant it is likely to be in the documents it does occur in.

For a term t inverse document frequency is calculated as:

IDF (t) = 1 + log (total number of documents/ number of documents containing t)

Usually, TF-IDF is the widely used improvement to provide weightage to words. It is the product of the term frequency and inverse document frequency.

TFIDF (t, d) = TF (t, d) * IDF (t)

Where t is the term, d id the document, TF(t,d) is frequency of term t in document d and IDF(t), as explained earlier.

Sometimes, we find that some words can be found in groups and make more sense instead of appearing alone. A group of words in a sequence is called n-grams. Words that have a high probability to occur together can be identified by Mahout. To find out whether two words occur together by chance or because they are a unit, Mahout passes these n-grams through log-likelihood tests.

Mahout provides two types of implementation of the vector format—sparse vector and dense vector. Applications that require fast random reads use RandomAccessSparseVector, which is a sparse implementation. Applications that require fast sequential reads can use SequentialAccessSparseVector.

Mahout uses sequence files to read and write the vectors. These sequence files are used by Hadoop for parallel execution. A sequence file contains binary key value pairs. Mahout provides commands to convert text document corpse into sequence files and later to generate vectors from those sequence files. These vectors are used as inputs for the algorithms.