Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Book Image

Hadoop MapReduce v2 Cookbook - Second Edition: RAW

Overview of this book

Table of Contents (19 chapters)
Hadoop MapReduce v2 Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Chapter 10. Mass Text Data Processing

In this chapter, we will cover the following topics:

  • Data preprocessing (extract, clean, and format conversion) using Hadoop streaming and Python

  • De-duplicating data using Hadoop streaming

  • Loading large datasets to an Apache HBase data store – importtsv and bulkload

  • Creating TF and TF-IDF vectors for the text data

  • Clustering text data using Apache Mahout

  • Topic discovery using Latent Dirichlet Allocation (LDA)

  • Document classification using Mahout Naive Bayes Classifier