Hadoop MapReduce together with the supportive set of projects makes it a good framework of choice to process large text datasets and to perform extract-transform-load (ETL) type operations.
In this chapter, we'll be exploring how to use Hadoop streaming to perform data preprocessing operations such as data extraction, format conversion, and de-duplication. We'll also use HBase as the data store to store the data and will explore mechanisms to perform large bulk data loads to HBase with minimal overhead. Finally, we'll look into performing text analytics using the Apache Mahout algorithms.
We will be using the following sample dataset for the recipes in this chapter:
20 Newsgroups dataset available at http://qwone.com/~jason/20Newsgroups. This dataset contains approximately 20,000 newsgroup documents originally collected by Ken Lang.
Tip
Sample code
The example code files for this book are available in GitHub at https://github.com/thilg/hcb-v2. The chapter10
folder of the code repository...