In this chapter, we will cover the following topics:
Data preprocessing (extract, clean, and format conversion) using Hadoop streaming and Python
De-duplicating data using Hadoop streaming
Loading large datasets to an Apache HBase data store – importtsv and bulkload
Creating TF and TF-IDF vectors for the text data
Clustering text data using Apache Mahout
Topic discovery using Latent Dirichlet Allocation (LDA)
Document classification using Mahout Naive Bayes Classifier