HBase Administration Cookbook

HBase Administration Cookbook

By : Yifeng Jiang

Buy this Book

HBase Administration Cookbook

By: Yifeng Jiang

Buy this Book

Overview of this book

As an Open Source distributed big data store, HBase scales to billions of rows, with millions of columns and sits on top of the clusters of commodity machines. If you are looking for a way to store and access a huge amount of data in real-time, then look no further than HBase.HBase Administration Cookbook provides practical examples and simple step-by-step instructions for you to administrate HBase with ease. The recipes cover a wide range of processes for managing a fully distributed, highly available HBase cluster on the cloud. Working with such a huge amount of data means that an organized and manageable process is key and this book will help you to achieve that.The recipes in this practical cookbook start from setting up a fully distributed HBase cluster and moving data into it. You will learn how to use all of the tools for day-to-day administration tasks as well as for efficiently managing and monitoring the cluster to achieve the best performance possible. Understanding the relationship between Hadoop and HBase will allow you to get the best out of HBase so the book will show you how to set up Hadoop clusters, configure Hadoop to cooperate with HBase, and tune its performance.

HBase Administration Cookbook

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Setting Up HBase Cluster

Introduction

Quick start

Getting ready on Amazon EC2

Setting up Hadoop

Setting up ZooKeeper

Changing the kernel settings

Setting up HBase

Basic Hadoop/ZooKeeper/HBase configurations

Setting up multiple High Availability (HA) masters

Data Migration

Introduction

Importing data from MySQL via single client

Importing data from TSV files using the bulk load tool

Writing your own MapReduce job to import data

Precreating regions before moving data into HBase

Using Administration Tools

Introduction

HBase Master web UI

Using HBase Shell to manage tables

Using HBase Shell to access data in HBase

Using HBase Shell to manage the cluster

Executing Java methods from HBase Shell

Row counter

WAL tool—manually splitting and dumping WALs

HFile tool—viewing textualized HFile content

HBase hbck—checking the consistency of an HBase cluster

Hive on HBase—querying HBase using a SQL-like language

Backing Up and Restoring HBase Data

Introduction

Full shutdown backup using distcp

Using CopyTable to copy data from one table to another

Exporting an HBase table to dump files on HDFS

Restoring HBase data by importing dump files from HDFS

Backing up NameNode metadata

Backing up region starting keys

Cluster replication

Monitoring and Diagnosis

Introduction

Showing the disk utilization of HBase tables

Setting up Ganglia to monitor an HBase cluster

OpenTSDB—using HBase to monitor an HBase cluster

Setting up Nagios to monitor HBase processes

Using Nagios to check Hadoop/HBase logs

Simple scripts to report the status of the cluster

Hot region—write diagnosis

Maintenance and Security

Introduction

Enabling HBase RPC DEBUG-level logging

Graceful node decommissioning

Adding nodes to the cluster

Rolling restart

Simple script for managing HBase processes

Simple script for making deployment easier

Kerberos authentication for Hadoop and HBase

Configuring HDFS security with Kerberos

HBase security configuration

Troubleshooting

Introduction

Troubleshooting tools

Handling the XceiverCount error

Handling the "too many open files" error

Handling the "unable to create new native thread" error

Handling the "HBase ignores HDFS client configuration" issue

Handling the ZooKeeper client connection error

Handling the ZooKeeper session expired error

Handling the HBase startup error on EC2

Basic Performance Tuning

Introduction

Setting up Hadoop to spread disk I/O

Using network topology script to make Hadoop rack-aware

Mounting disks with noatime and nodiratime

Setting vm.swappiness to 0 to avoid swap

Java GC and HBase heap settings

Using compression

Managing compactions

Managing a region split

Advanced Configurations and Tuning

Introduction

Benchmarking HBase cluster with YCSB

Increasing region server handler count

Precreating regions using your own algorithm

Avoiding update blocking on write-heavy clusters

Tuning memory size for MemStores

Client-side tuning for low latency systems

Configuring block cache for column families

Client side scanner setting

Tuning block size to improve seek performance

Enabling Bloom Filter to improve the overall throughput

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Setting up Hadoop

A fully distributed HBase runs on top of HDFS. As a fully distributed HBase cluster installation, its master daemon (HMaster) typically runs on the same server as the master node of HDFS (NameNode), while its slave daemon (HRegionServer) runs on the same server as the slave node of HDFS, which is called DataNode.

Hadoop MapReduce is not required by HBase. MapReduce daemons do not need to be started. We will cover the setup of MapReduce in this recipe too, in case you like to run MapReduce on HBase. For a small Hadoop cluster, we usually have a master daemon of MapReduce (JobTracker) run on the NameNode server, and slave daemons of MapReduce (TaskTracker) run on the DataNode servers.

This recipe describes the setup of Hadoop. We will have one master node (master1) run NameNode and JobTracker on it. We will set up three slave nodes (slave1 to slave3), which will run DataNode and TaskTracker on them, respectively.

Getting ready

You will need four small EC2 instances, which can be obtained by using the following command:

$ec2-run-instances ami-77287b32 -t m1.small -n 4 -k your-key-pair

All these instances must be set up properly, as described in the previous recipe, Getting ready on Amazon EC2. Besides the NTP and DNS setups, Java installation is required by all servers too.

We will use the hadoop user as the owner of all Hadoop daemons and files. All Hadoop files and data will be stored under /usr/local/hadoop. Add the hadoop user and create a /usr/local/hadoop directory on all the servers, in advance.

We will set up one Hadoop client node as well. We will use client1, which we set up in the previous recipe. Therefore, the Java installation, hadoop user, and directory should be prepared on client1 too.

How to do it...

Here are the steps to set up a fully distributed Hadoop cluster:

1. In order to SSH log in to all nodes of the cluster, generate the hadoop user's public key on the master node:
```
hadoop@master1$ ssh-keygen -t rsa -N ""
```
- This command will create a public key for the hadoop user on the master node, at ~/.ssh/id_rsa.pub.

2. On all slave and client nodes, add the hadoop user's public key to allow SSH login from the master node:

hadoop@slave1$ mkdir ~/.ssh
hadoop@slave1$ chmod 700 ~/.ssh
hadoop@slave1$ cat >> ~/.ssh/authorized_keys

3. Copy the hadoop user's public key you generated in the previous step, and paste to ~/.ssh/authorized_keys. Then, change its permission as following:
```
hadoop@slave1$ chmod 600 ~/.ssh/authorized_keys
```
4. Get the latest, stable, HBase-supported Hadoop release from Hadoop's official site, http://www.apache.org/dyn/closer.cgi/hadoop/common/. While this chapter was being written, the latest HBase-supported, stable Hadoop release was 1.0.2. Download the tarball and decompress it to our root directory for Hadoop, then add a symbolic link, and an environment variable:
```
hadoop@master1$ ln -s hadoop-1.0.2 current
hadoop@master1$ export HADOOP_HOME=/usr/local/hadoop/current
```

5. Create the following directories on the master node:

hadoop@master1$ mkdir -p /usr/local/hadoop/var/dfs/name
hadoop@master1$ mkdir -p /usr/local/hadoop/var/dfs/data
hadoop@master1$ mkdir -p /usr/local/hadoop/var/dfs/namesecondary

6. You can skip the following steps if you don't use MapReduce:
```
hadoop@master1$ mkdir -p /usr/local/hadoop/var/mapred
```

7. Set up JAVA_HOME in Hadoop's environment setting file (hadoop-env.sh):

hadoop@master1$ vi $HADOOP_HOME/conf/hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.6

8. Add the hadoop.tmp.dir property to core-site.xml:

hadoop@master1$ vi $HADOOP_HOME/conf/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/var</value>
</property>

9. Add the fs.default.name property to core-site.xml:

hadoop@master1$ vi $HADOOP_HOME/conf/core-site.xml
<property>
<name>fs.default.name</name>
<value>hdfs://master1:8020</value>
</property>

10. If you need MapReduce, add the mapred.job.tracker property to mapred-site.xml:

hadoop@master1$ vi $HADOOP_HOME/conf/mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>master1:8021</value>
</property>

11. Add a slave server list to the slaves file:

hadoop@master1$ vi $HADOOP_HOME/conf/slaves
slave1
slave2
slave3

12. Sync all Hadoop files from the master node, to client and slave nodes. Don't sync ${hadoop.tmp.dir} after the initial installation:

hadoop@master1$ rsync -avz /usr/local/hadoop/ client1:/usr/local/hadoop/
hadoop@master1$ for i in 1 2 3
do rsync -avz /usr/local/hadoop/ slave$i:/usr/local/hadoop/
sleep 1
done

13. You need to format NameNode before starting Hadoop. Do it only for the initial installation:
```
hadoop@master1$ $HADOOP_HOME/bin/hadoop namenode -format
```

14. Start HDFS from the master node:

hadoop@master1$ $HADOOP_HOME/bin/start-dfs.sh

15. You can access your HDFS by typing the following command:
```
hadoop@master1$ $HADOOP_HOME/bin/hadoop fs -ls /
```
- You can also view your HDFS admin page from the browser. Make sure the 50070 port is opened. The HDFS admin page can be viewed at http://master1:50070/dfshealth.jsp:
16. Start MapReduce from the master node, if needed:
```
hadoop@master1$ $HADOOP_HOME/bin/start-mapred.sh
```
- Now you can access your MapReduce admin page from the browser. Make sure the 50030 port is opened. The MapReduce admin page can be viewed at http://master1:50030/jobtracker.jsp:
17. To stop HDFS, execute the following command from the master node:
```
hadoop@master1$ $HADOOP_HOME/bin/stop-dfs.sh
```
18. To stop MapReduce, execute the following command from the master node:
```
hadoop@master1$ $HADOOP_HOME/bin/stop-mapred.sh
```

How it works...

To start/stop the daemon on remote slaves from the master node, a passwordless SSH login of the hadoop user is required. We did this in step 1.

HBase must run on a special HDFS that supports a durable sync implementation. If HBase runs on an HDFS that has no durable sync implementation, it may lose data if its slave servers go down. Hadoop versions later than 0.20.205, including Hadoop 1.0.2 which we have chosen, support this feature.

HDFS and MapReduce use local filesystems to store their data. We created directories required by Hadoop in step 3, and set up the path to the Hadoop's configuration file in step 5.

In steps 9 to 11, we set up Hadoop so it could find HDFS, JobTracker, and slave servers. Before starting Hadoop, all Hadoop directories and settings need to be synced with the slave servers. The first time you start Hadoop (HDFS), you need to format NameNode. Note that you should only do this at the initial installation.

At this point, you can start/stop Hadoop using its start/stop script. Here we started/stopped HDFS and MapReduce separately, in case you don't require MapReduce. You can also use $HADOOP_HOME/bin/start-all.sh and stop-all.sh to start/stop HDFS and MapReduce using one command.

HBase Administration Cookbook

By : Yifeng Jiang

HBase Administration Cookbook

By: Yifeng Jiang

Overview of this book

Related Content you might be interested in

Current Title:

HBase Administration Cookbook

Setting up Hadoop

Getting ready

How to do it...

How it works...