Rapid - Apache Mahout Clustering designs

Since childhood, we have been grouping similar things together. You can see that kids demand something (such as candies, ice cream, chocolates, toys, a cycle, and so on) in return for a favor. We can see examples in our organizations where we group people (peers, vendors, clients, and so on) based on their work.

The idea of clustering is similar to this. It is a technique that tries to group items together based on some sort of similarity. These groups will be divided in a way that the items in one group have little or no similarity to the items in a different group.

Let's take an example. Suppose you have a folder in your computer that is filled with videos and you know nothing about the content of the videos. One of your friends asks you whether you have the science class lecture video. What will you do in this case? One way is to quickly check the content of the videos, and as soon as you find the relevant video you will share it. Another way of doing this is to create a subfolder inside your main folder and categorize the videos by movies, songs, education, and so on. You can further organize movies as action, romance, thriller, and so on. If you think again, you have actually solved one of your clustering problems—grouping similar items together in such a way that they are similar in one group but different from the other group.

Now, let's address a major question that beginners usually ask at this stage—how is it is different from classification? Take your video folder example again; in the case of classification, subfolders of movies, songs, and education will already be there, and based on the content, you will put your new videos into relevant folders. However, in the case of clustering, we were not aware of the folders (or labels in machine learning terms) based on the content, we divided the videos and later assigned the label to them.

As you can see in the following figure, we have different points and based on shapes, we group them into different clusters:

The clustering task can be divided into the following:

The pattern finding algorithm: We will discuss these algorithms throughout this book; they are K-means, spectral clustering, and so on. The basic idea is that we should have an algorithm that can detect patterns in given datasets.
The distance measuring technique: To calculate the closeness of different items, there are certain measures in place, such as Euclidean distance, cosine distance measure, and so on. We will discuss different techniques of distance measure in this chapter.
Grouping and Stopping: With the algorithm and distance measure techniques, items will be grouped in different clusters, and based on the conditions, such as the elbow method, cross validation, and so on, we will stop further grouping.
Analysis of Output: Once grouping is complete, we have a measuring technique to determine how well our model performed. We can use techniques such as F-measure, the Jaccard index, and so on to evaluate the cluster. We will discuss these techniques with the respective algorithm discussions.

Application of clustering

Clustering is used in a wide area of applications such as bioinformatics, web search, image pattern recognition, and sequence analysis. There are a number of fields where clustering is used. Some of them are as follows:

For marketing: Clustering can be useful to segment customers based on geographical location, age, and consumption patterns. Clustering can be used to create a 360 degree view of customers, which is useful for customer relationship management. This is useful in creating new customers, retaining existing customers, launching new products, and in product positioning.
For recommendations: Clusters are very useful in creating recommendation system applications. Recommender systems are applications that suggest new items to customers based on their previous search or based on the item purchased by similar customers. Clustering is useful to create groups of customers based on their preferences.
Image segmentation: Clustering is used to partition the image into multiple regions and pixel in each region shares the common properties.
In bioinformatics: Clustering is used in many areas of bio-informatics and one major area is human genome clustering—identifying patterns in a genome, which leads to discover the cure for disease.
Clustering in web search: Clustering is useful to group results, resulted after keyword search on the Web.

Clustering is useful in almost all industries, and clustering is studied and researched in different areas such as data mining, machine learning, statistics, databases, biology, astrophysics, and many other fields.

Now, we will move on to the next section where we will discuss the different distance measuring technique that we used to find the similarity, or dissimilarity, between two data points as numeric values.

Rapid - Apache Mahout Clustering designs

Rapid - Apache Mahout Clustering designs

Overview of this book

Related Content you might be interested in

Current Title:

Rapid - Apache Mahout Clustering designs

The clustering concept

Application of clustering