Book Image

Rapid - Apache Mahout Clustering designs

Book Image

Rapid - Apache Mahout Clustering designs

Overview of this book

Table of Contents (16 chapters)
Apache Mahout Clustering Designs
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Performance tuning for the job


Closely investigating the Mahout job shows that Mahout jobs can create CPU and network bottlenecks. The distance computation and vectorization process is a CPU bound activity, while transmitting centroids to the reducer is a network bound activity. By closely investigating the parameters of the job's CPU, network, disk, and so on, the pitfalls can be avoided.

We can create a different type of vector representation of data in Mahout, such as dense vector, sparse vector, and so on As per the definition of the dense vector, it saves the zero for non-existing elements. So, if the data is very sparse, the dense vector will unnecessarily serialize the data and slow down the performance. So, in this case, it is better to use sparse vector representation for the data. For the sparse vector selection, also choose the implementation based on the distance measure. For example, Sequential Sparse Vector is best suited for the cosine distance measure because there is a need...