HBase Administration Cookbook

HBase Administration Cookbook

By : Yifeng Jiang

Buy this Book

HBase Administration Cookbook

By: Yifeng Jiang

Buy this Book

Overview of this book

As an Open Source distributed big data store, HBase scales to billions of rows, with millions of columns and sits on top of the clusters of commodity machines. If you are looking for a way to store and access a huge amount of data in real-time, then look no further than HBase.HBase Administration Cookbook provides practical examples and simple step-by-step instructions for you to administrate HBase with ease. The recipes cover a wide range of processes for managing a fully distributed, highly available HBase cluster on the cloud. Working with such a huge amount of data means that an organized and manageable process is key and this book will help you to achieve that.The recipes in this practical cookbook start from setting up a fully distributed HBase cluster and moving data into it. You will learn how to use all of the tools for day-to-day administration tasks as well as for efficiently managing and monitoring the cluster to achieve the best performance possible. Understanding the relationship between Hadoop and HBase will allow you to get the best out of HBase so the book will show you how to set up Hadoop clusters, configure Hadoop to cooperate with HBase, and tune its performance.

HBase Administration Cookbook

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Setting Up HBase Cluster

Introduction

Quick start

Getting ready on Amazon EC2

Setting up Hadoop

Setting up ZooKeeper

Changing the kernel settings

Setting up HBase

Basic Hadoop/ZooKeeper/HBase configurations

Setting up multiple High Availability (HA) masters

Data Migration

Introduction

Importing data from MySQL via single client

Importing data from TSV files using the bulk load tool

Writing your own MapReduce job to import data

Precreating regions before moving data into HBase

Using Administration Tools

Introduction

HBase Master web UI

Using HBase Shell to manage tables

Using HBase Shell to access data in HBase

Using HBase Shell to manage the cluster

Executing Java methods from HBase Shell

Row counter

WAL tool—manually splitting and dumping WALs

HFile tool—viewing textualized HFile content

HBase hbck—checking the consistency of an HBase cluster

Hive on HBase—querying HBase using a SQL-like language

Backing Up and Restoring HBase Data

Introduction

Full shutdown backup using distcp

Using CopyTable to copy data from one table to another

Exporting an HBase table to dump files on HDFS

Restoring HBase data by importing dump files from HDFS

Backing up NameNode metadata

Backing up region starting keys

Cluster replication

Monitoring and Diagnosis

Introduction

Showing the disk utilization of HBase tables

Setting up Ganglia to monitor an HBase cluster

OpenTSDB—using HBase to monitor an HBase cluster

Setting up Nagios to monitor HBase processes

Using Nagios to check Hadoop/HBase logs

Simple scripts to report the status of the cluster

Hot region—write diagnosis

Maintenance and Security

Introduction

Enabling HBase RPC DEBUG-level logging

Graceful node decommissioning

Adding nodes to the cluster

Rolling restart

Simple script for managing HBase processes

Simple script for making deployment easier

Kerberos authentication for Hadoop and HBase

Configuring HDFS security with Kerberos

HBase security configuration

Troubleshooting

Introduction

Troubleshooting tools

Handling the XceiverCount error

Handling the "too many open files" error

Handling the "unable to create new native thread" error

Handling the "HBase ignores HDFS client configuration" issue

Handling the ZooKeeper client connection error

Handling the ZooKeeper session expired error

Handling the HBase startup error on EC2

Basic Performance Tuning

Introduction

Setting up Hadoop to spread disk I/O

Using network topology script to make Hadoop rack-aware

Mounting disks with noatime and nodiratime

Setting vm.swappiness to 0 to avoid swap

Java GC and HBase heap settings

Using compression

Managing compactions

Managing a region split

Advanced Configurations and Tuning

Introduction

Benchmarking HBase cluster with YCSB

Increasing region server handler count

Precreating regions using your own algorithm

Avoiding update blocking on write-heavy clusters

Tuning memory size for MemStores

Client-side tuning for low latency systems

Configuring block cache for column families

Client side scanner setting

Tuning block size to improve seek performance

Enabling Bloom Filter to improve the overall throughput

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Setting up HBase

A fully distributed HBase instance has one or more master nodes (HMaster), and many slave nodes (RegionServer) running on HDFS. It uses a reliable ZooKeeper ensemble to coordinate all the components of the cluster, including masters, slaves, and clients.

It's not necessary to run HMaster on the same server of HDFS NameNode, but, for a small cluster, it's typical to have them run on the same server, just for ease of management. RegionServers are usually configured to run on servers of HDFS DataNode. Running RegionServer on the DataNode server has the advantage of data locality too. Eventually, DataNode running on the same server, will have a copy on it of all the data that RegionServer requires.

This recipe describes the setup of a fully distributed HBase. We will set up one HMaster on master1, and three region servers (slave1 to slave3). We will also set up an HBase client on client1.

Getting ready

First, make sure Java is installed on all servers of the cluster.

We will use the hadoop user as the owner of all HBase daemons and files, too. All HBase files and data will be stored under /usr/local/hbase. Create this directory on all servers of your HBase cluster, in advance.

We will set up one HBase client on client1. Therefore, the Java installation, hadoop user, and directory should be prepared on client1 too.

Make sure HDFS is running. You can ensure it started properly by accessing HDFS, using the following command:

hadoop@client1$ $HADOOP_HOME/bin/hadoop fs -ls /

MapReduce does not need to be started, as HBase does not normally use it.

We assume that you are managing your own ZooKeeper, in which case, you can start it and confirm if it is running properly. You can ensure it is running properly by sending the ruok command to its client port:

hadoop@client1$ echo ruok | nc master1 2181

How to do it...

To set up our fully distributed HBase cluster, we will download and configure HBase on the master node first, and then sync to all slave nodes and clients.

Get the latest stable HBase release from HBase's official site, http://www.apache.org/dyn/closer.cgi/hbase/.

At the time of writing this book, the current stable release was 0.92.1.

1. Download the tarball and decompress it to our root directory for HBase. Also, set an HBASE_HOME environment variable to make the setup easier:
```
hadoop@master1$ ln -s hbase-0.92.1 current
hadoop@master1$ export HBASE_HOME=/usr/local/hbase/current
```
2. We will use /usr/local/hbase/var as a temporary directory of HBase on the local filesystem. Remove it first if you have created it for your standalone HBase installation:
```
hadoop@master1$ mkdir -p /usr/local/hbase/var
```

3. To tell HBase where the Java installation is, set JAVA_HOME in the HBase environment setting file (hbase-env.sh):

hadoop@master1$ vi $HBASE_HOME/conf/hbase-env.sh
# The java implementation to use. Java 1.6 required.
export JAVA_HOME=/usr/local/jdk1.6

4. Set up HBase to use the independent ZooKeeper ensemble:

hadoop@master1$ vi $HBASE_HOME/conf/hbase-env.sh
# Tell HBase whether it should manage it's own instance of ZooKeeper or not.
export HBASE_MANAGES_ZK=false

5. Add these settings to HBase's configuration file (hbase-site.xml):

hadoop@master1$ vi $HBASE_HOME/conf/hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master1:8020/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/usr/local/hbase/var</value>
</property>
<property>
<name>hbase.ZooKeeper.quorum</name>
<value>master1</value>
</property>
</configuration>

6. Configure the slave nodes of the cluster:

hadoop@master1$ vi $HBASE_HOME/conf/regionservers
slave1
slave2
slave3

7. Link the HDFS configuration file (hdfs-site.xml) to HBase's configuration folder (conf), so that HBase can see the HDFS's client configuration on your Hadoop cluster:
```
hadoop@master1$ ln -s $HADOOP_HOME/conf/hdfs-site.xml $HBASE_HOME/conf/hdfs-site.xml
```

8. Copy the hadoop-core and Zookeeper JAR file, and their dependencies, from your Hadoop and ZooKeeper installation:

hadoop@master1$ rm -i $HBASE_HOME/lib/hadoop-core-*.jar
hadoop@master1$ rm -i $HBASE_HOME/lib/ZooKeeper-*.jar
hadoop@master1$ cp -i $HADOOP_HOME/hadoop-core-*.jar $HBASE_HOME/lib/
hadoop@master1$ cp -i $HADOOP_HOME/lib/commons-configuration-1.6.jar $HBASE_HOME/lib/
hadoop@master1$ cp -i $ZK_HOME/ZooKeeper-*.jar $HBASE_HOME/lib/

9. Sync all the HBase files under /usr/local/hbase from master, to the same directory as client and slave nodes.
10. Start the HBase cluster from the master node:
```
hadoop@master1$ $HBASE_HOME/bin/start-hbase.sh
```
11. Connect to your HBase cluster from the client node:
```
hadoop@client1$ $HBASE_HOME/bin/hbase shell
```
- You can also access the HBase web UI from your browser. Make sure your master server's 60010 port is opened. The URL is http://master1:60010/master.jsp:
12. Stop the HBase cluster from the master node:
```
hadoop@master1$ $HBASE_HOME/bin/stop-hbase.sh
```

How it works...

Our HBase cluster is configured to use /hbase as its root directory on HDFS, by specifying the hbase.rootdir property. Because it is the first time HBase was started, it will create the directory automatically. You can see the files HBase created on HDFS from the client:

hadoop@client1$ $HADOOP_HOME/bin/hadoop fs -ls /hbase

We want our HBase to run on distributed mode, so we set hbase.cluster.distributed to true in hbase-site.xml.

We also set up the cluster to use an independent ZooKeeper ensemble by specifying HBASE_MANAGES_ZK=false in hbase-env.sh. The ZooKeeper ensemble is specified by the hbase.ZooKeeper.quorum property. You can use clustered ZooKeeper by listing all the servers of the ensemble, such as zoo1,zoo2,zoo3.

All region servers are configured in the $HBASE_HOME/conf/regionservers file. You should use one line per region server. When starting the cluster, HBase will SSH into each region server configured here, and start the HRegionServer daemon on that server.

By linking hdfs-site.xml under the $HBASE_HOME/conf directory, HBase will use all the client configurations you made for your HDFS in hdfs-site.xml, such as the dfs.replication setting.

HBase ships with its prebuilt hadoop-core and ZooKeeper JAR files. They may be out of date, compared to what you used in your Hadoop and ZooKeeper installation. Make sure HBase uses the same version of .jar files with Hadoop and ZooKeeper, to avoid any unexpected problems.

HBase Administration Cookbook

By : Yifeng Jiang

HBase Administration Cookbook

By: Yifeng Jiang

Overview of this book

Related Content you might be interested in

Current Title:

HBase Administration Cookbook

Setting up HBase

Getting ready

How to do it...

How it works...