Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Understanding distance measures


In each clustering problem, such as document clustering, protein clustering, genome sequence, galaxy image grouping, and so on, we need to calculate the distance between points. Let's start this problem with the points in a two-dimensional XY plane, and later in the section of preparing data for Mahout, we will discuss how to handle other types of documents, such as text.

Suppose that we have a set of points and know the points' coordinates (x and y position). We want to group the points into different clusters. How do we achieve this? For this type of problem, we will calculate the distance between points, the points close to each other will be part of one cluster, and the points; that are away from one group will be part of another. See the following figure:

Any clustering algorithm will divide the given points into two groups, and with great accuracy, but, to be honest, in a real scenario, we will not face such simple problems. As defined, we used a distance measure to calculate the distance between points. We will only discuss distance measuring techniques that are available in the Apache Mahout 0.9 releases. An introduction of Mahout is given later in this chapter. For numeric variables, distance measures are as follow:

  • ChebyshevDistanceMeasure

  • CosineDistanceMeasure

  • EuclideanDistanceMeasure

  • MahalanobisDistanceMeasure

  • ManhattanDistanceMeasure

  • MinkowskiDistanceMeasure

  • SquaredEuclideanDistanceMeasure

  • TanimotoDistanceMeasure

  • WeightedEuclideanDistanceMeasure

  • WeightedManhattanDistanceMeasure

Note that for categorical variables, distance is measured with matching categories. For example, say we have six categories of variables, three type of colors, and two sets of data:

 

Color 1

Color 2

Color 3

Color 4

Color 5

Color 6

Dataset 1

Blue

Red

Green

Red

Blue

Blue

Dataset 2

Red

Red

Green

Blue

Green

Green

We can see that we have two matching and four nonmatching items, so the distance will be 4/6 ~= 0.67.

Let's understand what these distance measures are, but, before that, here's a reminder about the vectors that you learn in your physics class. A vector is a quantity that has both magnitude and direction. As shown in the figure, we represent a vector on the Cartesian plane by showing coordinates.

EuclideanDistanceMeasure is the simplest among all the distance measures. Euclidean distance between two n-dimensional vectors (x1, x2, x3…xn) and (y1, y2, y3…yn) is calculated as summing the square root of the squared differences between each coordinate. Mathematically, it is represented as:

The Mahout implementation of the Euclidean distance is available as the EuclideanDistanceMeasure class under the org.apache.mahout.common.distance package.

SquaredEuclideanDistanceMeasure is the square of the Euclidean distance. It provides greater weights to points that are farther apart. It is implemented as the SquaredEuclideanDistanceMeasure class under the org.apache.mahout.common.distance package. Mathematically, it is represented as:

WeightedEuclideanDistanceMeasure is implemented as the WeightedEuclideanDistanceMeasure class under the org.apache.mahout.common.distance package. In Weighted Euclidean Distance, squared differences between the variables are multiplied by their corresponding weights. Weights can be defined as:

where is sample standard deviation of ith variable.

Mathematically, Weighted Euclidean Distance is represented as:

ManhattanDistanceMeasure is the sum of absolute difference between the coordinates. It is implemented as the ManhattanDistanceMeasure class under the same package, as other distance measure classes exist. Manhattan distance is also called taxicab geometry. It is based on grid-like street geography of Manhattan, New York. In this grid structure, distance is the sum of horizontal and vertical lines. As shown in the following figure, the Manhattan distance for point A and B from line 1 and 2 is equal to 9 blocks:

Mathematically, this is defined as:

The WeightedManhattanDistanceMeasure class is implemented as WeightedManhattanDistanceMeasure under the same distance package in Mahout. This class implements a Manhattan distance metric by summing the absolute values of the difference between each coordinate with weights, as in case of weighted Euclidean distance.

The ChebyshevDistanceMeasure distance measure is equivalent to a maximum of the absolute value of the difference between each coordinate. This distance is also known as chessboard distance because of the moves a king can make. Mathematically, this can be defined as:

So, let's say, for example, that we have two vectors, vector X (0,2,3,-4) and vector Y (6,5,4,1). The Chebyshev distance will be maximum (|0-6|,|2-5|,|3-4|,|-4-1|), which is 6.

This class is defined as ChebyshevDistanceMeasure under the same distance package in Mahout.

The CosineDistanceMeasure Cosine distance is a little different from the distance measures that we have studied so far. Cosine distance measures the similarity of the two vectors as the measure of the cosine of the angle between two vectors. Consider the following figure for more clarity. We have two vectors A and B, and the angle between them is θ.

Cosine of the angle will be close to 1 for a smaller angle and decreases for a larger angle. Cosine for 900 is 0 and cosine for 1800 is -1. So, vectors in the same directions have a similarity of 1, and those in the opposite direction have a similarity of -1. Mathematically, it is defined as:

Subtraction of 1 is used to provide a proper distance so vectors close to each other (00) will provide 0, and those opposite each other (1800) will provide 2. So, in this distance, instead of the vector length, their directions matters.

MinkowskiDistanceMeasure is a generalization of the Euclidean and Manhattan distance. This distance is defined in the MinkowskiDistanceMeasure class under the same distance package of Mahout. Mathematically, it is defined as:

If you see, for c=1, it is the Manhattan distance measure, and for c=2, it is the Euclidean distance measure.

TanimotoDistanceMeasure: In cosine distance, we take only the angle between the vectors into account but for the tanimoto distance measure, we also take the relative distance of vectors into account. Mathematically, it can be defined as:

This is defined as the TanimotoDistanceMeasure class under the org.apache.mahout.common.distance package.