Book Image

Apache Mahout Clustering Designs

Book Image

Apache Mahout Clustering Designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Understanding different clustering techniques


A number of different clustering techniques are available in the area of machine learning and data mining. There are algorithms based on these different techniques. Let's see these different techniques:

Hierarchical methods

In this clustering method, the given data is divided hierarchically. To help you understand this, let's take an example of animals' class hierarchy. We have two groups, invertebrate and vertebrate, but we can combine them into one animal class. Hierarchical clustering can be done using two approaches, which are:

  • The top-down approach: This is also called the divisive approach. In this approach, all the datasets start with one cluster, and each iteration cluster is further divided into sub clusters. This process goes on until we meet a termination condition.

  • The bottom-up approach: This approach is also called the agglomerative approach. This method starts with each dataset in separate clusters and successively merges the dataset into closer clusters until all sub clusters are merged into one cluster.

    We can take a real-life example from our organizational structure. We have all the employees' data, and we can divide them into clusters such as finance, HR, operations, and so on. The main pain point of hierarchical clustering is deciding between merger or split points.

The partitioning method

In this method, we select a number, k, to create k numbers of clusters. Partitioning methods are generally distance-based.

The partitioning method involves a few steps to acquire better partitioning. First, it creates the initial partitioning, and after that, it calculates the similarity of items based on the distance measuring techniques. Iteratively, it relocates the objects to another group based on the similarity it calculates. Partitioning is said to be good if it keeps similar items in one cluster while the items in different clusters are different from each other.

K-means is a very good example of this method. K-means is used in many areas, such as human genetic clustering, shopping cart item clustering, and so on. We will discuss these algorithms in the upcoming chapter.

The density-based method

The density-based clustering algorithm groups together points that are closely packed together. So, for each data point within a given cluster, the neighborhood of a given radius has to contain at least a minimum number of points, which is given as part of the input to the algorithm. Such a method can be used to filter out noise or outliers. DBSCAN is a very popular algorithm in this area. It can detect arbitrary shapes in the cluster, and it can detect outliers in the data. This should not be used for high-dimensional datasets.

Probabilistic clustering

In probabilistic clustering, we take up the problems where we know that data is coming from different models of probability distribution. The probabilistic clustering algorithm takes a probabilistic model and tries to fit the data into that model.

The model that we built is used to fit the given dataset. The model is said to be right if it fits the data better and shows the number of clusters in the given dataset in line with the model.

By calculating the probability of the model being a fit for dataset and reading the vectors, we test whether a dataset is a fit for the model. This is also called soft clustering.