This chapter explains how to set up HBase cluster, from a basic standalone HBase instance to a fully distributed, highly available HBase cluster on Amazon EC2.
According to Apache HBase's home page:
HBase is the Hadoop database. Use HBase when you need random, real-time, read/write access to your Big Data. This project's goal is the hosting of very large tables—billions of rows X millions of columns—atop clusters of commodity hardware.
HBase can run against any filesystem. For example, you can run HBase on top of an EXT4 local filesystem, Amazon Simple Storage Service (Amazon S3), and Hadoop Distributed File System (HDFS) , which is the primary distributed filesystem for Hadoop. In most cases, a fully distributed HBase cluster runs on an instance of HDFS, so we will explain how to set up Hadoop before proceeding.
Apache ZooKeeper is an open source software providing a highly reliable, distributed coordination service. A distributed HBase depends on a running ZooKeeper cluster.
HBase, which is a database that runs on Hadoop, keeps a lot of files open at the same time. We need to change some Linux kernel settings to run HBase smoothly.
A fully distributed HBase cluster has one or more master nodes (HMaster), which coordinate the entire cluster, and many slave nodes (RegionServer), which handle the actual data storage and request. The following diagram shows a typical HBase cluster structure:
HBase can run multiple master nodes at the same time, and use ZooKeeper to monitor and failover the masters. But as HBase uses HDFS as its low-layer filesystem, if HDFS is down, HBase is down too. The master node of HDFS, which is called NameNode, is the Single Point Of Failure (SPOF) of HDFS, so it is the SPOF of an HBase cluster. However, NameNode as a software is very robust and stable. Moreover, the HDFS team is working hard on a real HA NameNode, which is expected to be included in Hadoop's next major release.
The first seven recipes in this chapter explain how we can get HBase and all its dependencies working together, as a fully distributed HBase cluster. The last recipe explains an advanced topic on how to avoid the SPOF issue of the cluster.
We will start by setting up a standalone HBase instance, and then demonstrate setting up a distributed HBase cluster on Amazon EC2.