The Apache HBase data store is very useful when storing large-scale data in a semi-structured manner, so that it can be used for further processing using Hadoop MapReduce programs or to provide a random access data storage for client applications. In this recipe, we are going to import a large text dataset to HBase using the importtsv
and bulkload
tools.
Install and deploy Apache HBase in your Hadoop cluster.
Make sure Python is installed in your Hadoop compute nodes.
The following steps show you how to load the TSV (tab-separated value) converted 20news dataset in to an HBase table:
Follow the Data preprocessing using Hadoop streaming and Python recipe to perform the preprocessing of data for this recipe. We assume that the output of the following step 4 of that recipe is stored in an HDFS folder named "
20news-cleaned
":$ hadoop jar \ /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ ...