Book Image

Apache Mahout Essentials

By : Jayani Withanawasam
Book Image

Apache Mahout Essentials

By: Jayani Withanawasam

Overview of this book

Table of Contents (13 chapters)
Apache Mahout Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Summary


Clustering is an unsupervised learning mechanism that requires minimal human effort. Clustering has many applications in different areas, such as medical image processing, market segmentation, and information retrieval.

Clustering mechanisms can be divided into different types, such as hard, soft, flat, hierarchical, and model-based clustering based on different criteria.

Apache Mahout implements different clustering algorithms, which can be accessed sequentially or in parallel (using MapReduce).

The K-Means algorithm is a simple and fast algorithm that is widely applied. However, there are situations that the K-Means algorithm will not be able to cater to. For such scenarios, Apache Mahout has implemented other algorithms, such as canopy, Fuzzy K-Means, streaming, and spectral clustering.

Text clustering is an important area of clustering that requires special preprocessing steps, such as stop word removal, stemming, and TF-IDF vector generation. Topic modeling is a special case of...