Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Chapter 1. Understanding Clustering

Reduced hardware cost is giving us the opportunity to save a lot of data. We are now generating a lot of data, and this data can generate interesting patterns for various industries, which is why machine learning and data mining enthusiasts are responsible for this data.

The data from various industries can provide insights that can be very useful for the business. For example, sensor data on cars can be very useful for insurance majors. Data scientists can find out useful information from this data, such as driving speed, time of driving, mileage, breaking, and so on; they can also rate the driver, which in turn would be useful for the insurer to set up the premium.

In the health care industry, data collected from different patients is used to predict different diseases. A well-known use case is to predict whether a tumor will be cancerous or not based on the tumor size and other characteristics.

In bioinformatics, one well-known use case is grouping a homologous sequence into the gene family.

These problems are related to data gathering, finding useful pattern from data, and then enabling the machine to learn to identify patterns from new datasets. In the area of machine learning, learning can broadly be classified into the following three areas:

  • Supervised learning: In supervised learning, each sample consists of the input and the desired output, and we tried to predict the output from the given input set

  • Unsupervised learning: In unsupervised learning, we do not know anything about the output, but from the pattern of data, we can define different collections within the datasets

  • Reinforcement learning: In reinforcement learning, within a given context, machines and software agents automatically determine the ideal behavior

In this book, we will concentrate on unsupervised learning and the different methods applied to this. For the tool perspective, we will use Apache Mahout. You will learn what the algorithms available in Apache Mahout in the area of unsupervised learning are and how to use them.

In this chapter, we will explore following topics:

  • Understanding clustering

  • Applications of clustering

  • Understanding distance measures

  • Understanding different clustering techniques

  • Algorithm support in Apache Mahout

  • Installing Apache Mahout

  • Preparing data for use by clustering techniques