HBase Administration Cookbook

HBase Administration Cookbook

By : Yifeng Jiang

Buy this Book

HBase Administration Cookbook

By: Yifeng Jiang

Buy this Book

Overview of this book

As an Open Source distributed big data store, HBase scales to billions of rows, with millions of columns and sits on top of the clusters of commodity machines. If you are looking for a way to store and access a huge amount of data in real-time, then look no further than HBase.HBase Administration Cookbook provides practical examples and simple step-by-step instructions for you to administrate HBase with ease. The recipes cover a wide range of processes for managing a fully distributed, highly available HBase cluster on the cloud. Working with such a huge amount of data means that an organized and manageable process is key and this book will help you to achieve that.The recipes in this practical cookbook start from setting up a fully distributed HBase cluster and moving data into it. You will learn how to use all of the tools for day-to-day administration tasks as well as for efficiently managing and monitoring the cluster to achieve the best performance possible. Understanding the relationship between Hadoop and HBase will allow you to get the best out of HBase so the book will show you how to set up Hadoop clusters, configure Hadoop to cooperate with HBase, and tune its performance.

HBase Administration Cookbook

Credits

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Setting Up HBase Cluster

Introduction

Quick start

Getting ready on Amazon EC2

Setting up Hadoop

Setting up ZooKeeper

Changing the kernel settings

Setting up HBase

Basic Hadoop/ZooKeeper/HBase configurations

Setting up multiple High Availability (HA) masters

Data Migration

Introduction

Importing data from MySQL via single client

Importing data from TSV files using the bulk load tool

Writing your own MapReduce job to import data

Precreating regions before moving data into HBase

Using Administration Tools

Introduction

HBase Master web UI

Using HBase Shell to manage tables

Using HBase Shell to access data in HBase

Using HBase Shell to manage the cluster

Executing Java methods from HBase Shell

Row counter

WAL tool—manually splitting and dumping WALs

HFile tool—viewing textualized HFile content

HBase hbck—checking the consistency of an HBase cluster

Hive on HBase—querying HBase using a SQL-like language

Backing Up and Restoring HBase Data

Introduction

Full shutdown backup using distcp

Using CopyTable to copy data from one table to another

Exporting an HBase table to dump files on HDFS

Restoring HBase data by importing dump files from HDFS

Backing up NameNode metadata

Backing up region starting keys

Cluster replication

Monitoring and Diagnosis

Introduction

Showing the disk utilization of HBase tables

Setting up Ganglia to monitor an HBase cluster

OpenTSDB—using HBase to monitor an HBase cluster

Setting up Nagios to monitor HBase processes

Using Nagios to check Hadoop/HBase logs

Simple scripts to report the status of the cluster

Hot region—write diagnosis

Maintenance and Security

Introduction

Enabling HBase RPC DEBUG-level logging

Graceful node decommissioning

Adding nodes to the cluster

Rolling restart

Simple script for managing HBase processes

Simple script for making deployment easier

Kerberos authentication for Hadoop and HBase

Configuring HDFS security with Kerberos

HBase security configuration

Troubleshooting

Introduction

Troubleshooting tools

Handling the XceiverCount error

Handling the "too many open files" error

Handling the "unable to create new native thread" error

Handling the "HBase ignores HDFS client configuration" issue

Handling the ZooKeeper client connection error

Handling the ZooKeeper session expired error

Handling the HBase startup error on EC2

Basic Performance Tuning

Introduction

Setting up Hadoop to spread disk I/O

Using network topology script to make Hadoop rack-aware

Mounting disks with noatime and nodiratime

Setting vm.swappiness to 0 to avoid swap

Java GC and HBase heap settings

Using compression

Managing compactions

Managing a region split

Advanced Configurations and Tuning

Introduction

Benchmarking HBase cluster with YCSB

Increasing region server handler count

Precreating regions using your own algorithm

Avoiding update blocking on write-heavy clusters

Tuning memory size for MemStores

Client-side tuning for low latency systems

Configuring block cache for column families

Client side scanner setting

Tuning block size to improve seek performance

Enabling Bloom Filter to improve the overall throughput

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Basic Hadoop/ZooKeeper/HBase configurations

There are some basic settings we should tune, before moving forward. These are very basic and important Hadoop (HDFS), ZooKeeper, and HBase settings that you should consider to change immediately after setting up your cluster.

Some of these settings take effect due to data durability or cluster availability, which must be configured, while some are recommended configurations for running HBase smoothly.

Configuration settings depend on your hardware, data, and cluster size. We will describe a guideline in this recipe. You may need to change the settings to fit your environment.

Every time you make changes, you need to sync to all clients and slave nodes, then restart the respective daemon to apply the changes.

How to do it...

The configurations that should be considered for change are as follows:

1. Turn on dfs.support.append for HDFS. The dfs.support.append property determines whether HDFS should support the append (sync) feature or not. The default value is false. It must be set to true, or you may lose data if the region server crashes:
```
hadoop$ vi $HADOOP_HOME/conf/hdfs-site.xml

<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
```
2. Increase the dfs.datanode.max.xcievers value to have DataNode keep more threads open, to handle more concurrent requests:
```
hadoop$ vi $HADOOP_HOME/conf/hdfs-site.xml
<property>
<name>dfs.datanode.max.xcievers</name>
<value>4096</value>
</property>
```

3. Increase ZooKeeper's heap memory size so that it does not swap:

hadoop$ vi $ZK_HOME/conf/java.env
export JAVA_OPTS="-Xms1000m -Xmx1000m"

4. Increase ZooKeeper's maximum client connection number to handle more concurrent requests:
```
hadoop$ echo "maxClientCnxns=60" >> $ZK_HOME/conf/zoo.cfg
```

5. Increase HBase's heap memory size to run HBase smoothly:

hadoop$ vi $HBASE_HOME/conf/hbase-env.sh

export HBASE_HEAPSIZE=8000

6. Decrease the zookeeper.session.timeout value so that HBase can find the crashed region server fast, and recover it in a short time:
```
hadoop$ vi $HBASE_HOME/conf/hbase-site.xml
<property>
<name>zookeeper.session.timeout</name>
<value>60000</value>
</property>
```
7. To change Hadoop/ZooKeeper/HBase log settings, edit the log4j.properties file and the hadoop-env.sh/hbase-env.sh file under the conf directory of the Hadoop/ZooKeeper/HBase installation. It's better to change the log directory out of the installation folder. For example, the following specifies HBase to generate its logs under the /usr/local/hbase/logs directory:
```
hadoop$ vi $HBASE_HOME/conf/hbase-env.sh
export HBASE_LOG_DIR=/usr/local/hbase/logs
```

How it works...

In step 1, by turning on dfs.support.append, the HDFS flush is enabled. With this feature enabled, a writer of HDFS can guarantee that data will be persisted by invoking a flush call. So, HBase can guarantee that when a region server dies, data can be recovered and replayed on other region servers using its Write-Ahead Log (WAL) .

To verify if the HDFS append is supported or not, see your HMaster log of the HBase startup. If the append is not turned to on, you will find a log like the following:

$ grep -i "HDFS-200" hbase-hadoop-master-master1.log
...syncFs -- HDFS-200 -- not available, dfs.support.append=false

For step 2, we configured the dfs.datanode.max.xcievers setting, which specifies the upper bound on the number of files HDFS DataNode will serve at any one time.

Note

Note that the name is xcievers—it's a misspelled name. Its default value is 256, which is too low for running HBase on HDFS.

Steps 3 and 4 are about ZooKeeper settings. ZooKeeper is very sensitive to swapping, which will seriously degrade its performance. ZooKeeper's heap size is set in the java.env file. ZooKeeper has an upper bound on the number of connections it will serve at any one time. Its default is 10, which is too low for HBase, especially when running MapReduce on it. We would suggest setting it to 60.

In step 5, we configured HBase's heap memory size. HBase ships with a heap size of 1 GB, which is too low for modern machines. A reasonable value for large machines is 8 GB or larger, but under 16 GB.

In step 6, we changed the ZooKeeper's session timeout to a lower value. Lower timeout means HBase can find crashed region servers faster, and thus, recover the crashed regions on other servers in a short time. On the other hand, with a very short session timeout, there is a risk that the HRegionServer daemon may kill itself when the cluster is in heavy load, because it may not be able to send a heartbeat to the ZooKeeper before getting a timeout.

HBase Administration Cookbook

By : Yifeng Jiang

HBase Administration Cookbook

By: Yifeng Jiang

Overview of this book

Related Content you might be interested in

Current Title:

HBase Administration Cookbook

Basic Hadoop/ZooKeeper/HBase configurations

How to do it...

How it works...

Note

See also