HBase High Performance Cookbook

HBase High Performance Cookbook

By : Ruchir Choudhry

Buy this Book

HBase High Performance Cookbook

By: Ruchir Choudhry

Buy this Book

Overview of this book

Apache HBase is a non-relational NoSQL database management system that runs on top of HDFS. It is an open source, disturbed, versioned, column-oriented store and is written in Java to provide random real-time access to big Data. We’ll start off by ensuring you have a solid understanding the basics of HBase, followed by giving you a thorough explanation of architecting a HBase cluster as per our project specifications. Next, we will explore the scalable structure of tables and we will be able to communicate with the HBase client. After this, we’ll show you the intricacies of MapReduce and the art of performance tuning with HBase. Following this, we’ll explain the concepts pertaining to scaling with HBase. Finally, you will get an understanding of how to integrate HBase with other tools such as ElasticSearch. By the end of this book, you will have learned enough to exploit HBase for boost system performance.

HBase High Performance Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Configuring HBase

Introduction

Configuring and deploying HBase

Using the filesystem

Administering clusters

Managing clusters

Loading Data from Various DBs

Introduction

Extracting data from Oracle

Loading data using Oracle Big data connector

Bulk utilities

Using Hive with Apache HBase

Using Sqoop

Working with Large Distributed Systems Part I

Introduction

Scaling elastically or Auto Scaling with built-in fault tolerance

Auto Scaling HBase using AWS

Works on different VM/physical, cloud hardware

Working with Large Distributed Systems Part II

Snappy

Working with Scalable Structure of tables

Introduction

HBase data model part 1

HBase data model part 2

How HBase truly scales on key and schema design

HBase Clients

Introduction

HBase REST and Java Client

Working with Apache Thrift

Working with Apache Avro

Working with Protocol buffer

Working with Pig and using Shell

Large-Scale MapReduce

Introduction

HBase Performance Tuning

Introduction

Working with infrastructure/operating systems

Working with Java virtual machines

Changing the configuration of components

Working with HDFS

Performing Advanced Tasks on HBase

Machine learning using Hbase

Real-time data analysis using Hbase and Mahout

Full text indexing using Hbase

Optimizing Hbase for Cloud

Introduction

Configuring Hbase for the Cloud

Connecting to an Hbase cluster using the command line

Backing up and restoring Hbase

Terminating an HBase cluster

Accessing HBase data with hive

Viewing the Hbase user interface

Monitoring HBase with CloudWatch

Monitoring Hbase with Ganglia

Case Study

Introduction

Configuring Lily Platform

Integrating elastic search with Hbase

Configuring

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Administering clusters

It's pivotal at this time to know more about the HBase administrative process, as it stores petabyte of data in distributed locations and requires the system to work smoothly agnostic to the location of the data in a multithreaded, easy-to-manage environment in addition to the hardware, OS, JDK and other hardware and software components.

HbBase and administrative GUI provide the current state of the cluster and and plenty of command-line tools, which can really give us an in-depth knowledge about what is going on in the cluster.

Getting ready

We must have a HDFS/Hadoop setup in a cluster or fully distributed mode as the first step, which we did it in the earlier section. The second step is to have an HBase setup in a fully distributed mode.

Tip

It's assumed that we have a setup of password-less communication between the master and the region servers on the HBase side. If you are using the same nodes for Hadoop/Hdfs setup, we need to have the Hadoop user also to have a password-less setup from Namenode to Secondary NameNode, DataNodes, and so on.

We must have a full HBase cluster running on top of hadoop/HDFS cluster.

How to do it…

HBase provides multiple ways to do administration work on the fully distributed clustered environment:

WebUI-based environment
Command line-based environment
HBase Admin UI

The Master run on 60010 ports as default, the web interface is looked up on this port only. This needs to be started on the HMaster node.

The master UI gives a dashboard of what's going on in the HBase cluster.

The UI contains the following details:

HBase home
Table details
Local logs
Log levels
Debug dump
Metric dump
HBase configuration

We will go through it in the following sections.

HBase Home: It contains the dashboard that gives a holistic picture of the HBase cluster.
- The region server
- The backup master
- Tables
- A task
- Software attributes
Region Server: The image shows the region server and also provided a tab view of various metrics like (basic stats, Memory, Request, Storefiles, Compactions):

A Detailed discussion is out of scope at this point. For more details, see later sections. For our purpose, we will avoid to go for backup master:

Let's list all the user-created tables using Tables. It provides the details about User Tables (tables that are created by the users/actors), Catalog Tables (this contains HBase :meta and HBase :namespaces), which is seen in the following figure:

Clicking any table, as listed earlier, other important details are shown such as table attributes, table regions, and region-by-region server details. Actions as compaction and split can be taken using admin UI.

Task provides the details of talk, which is happening. We can see Monitored task, RPC Tasks, RPC Handler task, Active RPC Calls, Client Operations, and a JSON response:

The following Software Attributes page provides the details of the software used in the HBase cluster:

After clicking on zk_dump, it provides further details about the zookeeper Quorum stats as follows:

HBase provides various command-line tools for administrating, debugging, and doing an analysis on the HBase cluster.

The first tool is the HBase shell, and the details are as follows:

bin/HBase 
Usage: HBase [<options>] <command> [<args>]

Identify inconsistencies with hbck; this tool provides consistencies, checks for corruption of data, and it runs against the cluster.

bin/HBase  hbck

This runs against the cluster and provides the details of inconsistencies between the regions and masters.

bin/HBase  hbck –details

The -details provides the insight of splits which happens in all the tables.

  bin/HBase hbck MyClickStream

The preceding line enables us to vView the HFile content in a text format.

   bin/HBase org.apache.hadoop.HBase .io.hfile.HFile

Use FSFLogs for manual splitting and dumping:

HBase org.apache.hadoop.HBase .regionserver.wal.FSHLog --dump 
HBase org.apache.hadoop.HBase .regionserver.wal.FSHLog –split

Enable compressor tool:

HBase org.apache.hadoop.HBase .util.CompressionTest hdfs://host/path/to/HBase snappy

Enable compressor tool on the Column Family or while creating a tables:

HBase > disable 'MyClickStream'
HBase > alter 'MyClickStream', {NAME => 'cf', COMPRESSION => 'GZ'} 
HBase > enable 'MyClickStream'
HBase > create MyClickStream', {NAME =>'cf2', COMPRESSION => 'SNAPPY'}

Load Test too Usage below are some of the commands which can be used to do a quick load test on your compression performance:

bin/HBase org.apache.hadoop.HBase .util.LoadTestTool -h

usage: bin/HBase org.apache.hadoop.HBase .util.LoadTestTool <options>.

Options: includes –batchupdate

-compression: <arg> Compression type , arguments can be LZq,GZ,NONE and SNAPPY

We will limit ourselves with the above commands.

A good example will be as follows:

HBase org.apache.hadoop.HBase .util.LoadTestTool -write 2:20:20 -num_keys 500000 -read 60:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE
-write (here we are passing 2 as ->avg_cols_per_key)
20 is the ->avg_data_size>:
20 is number of parallel threads to be used.
-num_key has an integer arguments as 500000, this is the number of keys to read and write.

Now let's look at a read:

-read 60 is the verify percent

30 is the number of threads

-num_tables a positve interger is passed which is the number of tables to be loaded in parallel
-data_block_encoding there are various encoding algorithms which can be passed as an argument, this allow the data block to be encoded based on the need. Some of them  are [NONE,PREFIX,DIFF,FAST_DIFF,PREFIX_TREE].

-tn is a table name prefix which exports the content and data to HDFS in a sequence file using this:

HBase org.apache.hadoop.HBase .mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

Note

You can configure HBase .client.scanner.caching in the job configuration; this is for all the scans.

Importing: This tool will load data that has been exported back into HBase :

This can be done by the following command:

HBase org.apache.hadoop.HBase .mapreduce.Import <tablename> <inputdir>

-tablename: Is the name of the table to be imported by the tool.

-inputdir: Is the input dir which will be used.

Utility to Replay WAL files into HBase :

HBase org.apache.hadoop.HBase .mapreduce.WALPlayer [options] <wal inputdir> <tables> [<tableMappings>]> 
HBase org.apache.hadoop.HBase .mapreduce.WALPlayer /backuplogdir MyClickStream newMyClickStream
walinputdi:  /backuplogdir
tables:  MyClickStream
tableMappings→newMyClickStream
 newMyClickStream

HBase clean: is dangerous and should be avoided in production setup.

HBase clean

Options: as parameters

  --cleanZk   cleans HBase related data from zookeeper.
  --cleanHdfs cleans HBase related data from hdfs.
  --cleanAll  cleans HBase related data from both zookeeper and hdfs.

HBase pe: This is a shortcut to run the performance evaluations tools

HBase ltt: This command is a shortcut provided to run the rg.apache.hadoop.HBase .util.LoadTestTool utility. It was introduced in 0.98 version.

View the details of the table as shown here:

Go to the log tab on the HBase Admin UI home page, and you will see the following details. Alternatively, you can log in to the directory using the Linux shell to tail the logs.

Directory: /logs/:

SecurityAuth.audit 0 bytes  Aug 29, 2014 7:22:00 PM
HBase -hadoop-master-rchoudhry-linux64.com.log 691391 bytes   Sep 2, 2014 11:01:34 AM
HBase -hadoop-master-rchoudhry-linux64.com.out 419 bytes 	Aug 29, 2014 7:31:21 PM
HBase -hadoop-regionserver-rchoudhry-linux64.com.log 1048281 bytes   Sep 2, 2014 11:01:23 AM
HBase -hadoop-regionserver-rchoudhry-linux64com.out 419 bytes   Aug 29, 2014 7:31:23 PM
HBase -hadoop-zookeeper-rchoudhry-linux64.log 149832 bytes   Aug 31, 2014 12:51:42 AM
HBase -hadoop-zookeeper-rchoudhry-linux64.com.out 419 bytes   Aug 29, 2014 7:31:19 PM
HBase -hadoop-zookeeper-rchoudhry-linux64.com.out.1 1146 bytes   Aug 29, 2014 7:26:29 PM

Get and set the log levels as required at runtime:

Log dump

Have a look at what is going in the cluster with Log dump:

Master status for HBase -hadoop-master-rchoudhry-linux64.com,60000,1409365881345 as of Tue Sep 02 11:17:33 PDT 2014
Version Info:
===========================================================
HBase 0.98.5-hadoop2
Subversion file:///var/tmp/0.98.5RC0/HBase -0.98.5 -r Unknown
Compiled by apurtell on Mon Aug  4 23:58:06 PDT 2014
Hadoop 2.2.0
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z


Tasks:
===========================================================
Task: RpcServer.reader=1,port=60000
Status: WAITING:Waiting for a call
Running for 315970s

Task: RpcServer.reader=2,port=60000
Status: WAITING:Waiting for a call
Running for 315969s

Metrics dump

Exposes the JMX details for the following components in a JSON format using Matrix dump:

Start-up progress
Balancer
Assignment Manager
Java Management extension details
Java Runtime Implementation System Properties
Note
These are various system properties; the discussion of it is beyond the scope of this book.

How it works…

When the browser points to the http address http://your-HBase-master:60010/master-status, the web interface of HBase admin is loaded.

Internally, it connects to the Zookeeper and can collect the details from the zookeeper interface, where in zookeeper tracks the Region server as per the region server configuration in the region server file. These are the values set in the HBase -site.xml. The data for hadoo/hdfs, region servers, zookeeper quorum details are continuously looked by the RPC calls, which the master makes via zookeeper.

In the above HBase -site.xml, the user/ sets the Master, Backup master, various other software attributes, the refresh time, memory allocations, storefiles, compactions,request, zk dumps etc

Node or Cluster view: In this, the user chooses either monitoring Hmaster or Region server data view. The HMaster view contains data and graphics about the node status. The Region server view is the main and the most important one because it allows monitoring of all the region server aspects.

You can point the http address to http://your-HBase-master:60030/rs-status. It loads the admin UI for the region servers.

The matrix that is captured here is as follows:

Region server Metrics (Base stats, Memory, request, Hlog, Storefiles, Queues) Tasks happening in Region servers (Monitored, RPC, RPC handler, active RPC .JSON, client operations)
Block Cache provides different options for on-heap and off-heap: LurBlockCache and Bucket are off-heap.

Backup master is a design/architecture choice, which needs careful considerations before enabling it. HBase by design is a fault-tolerant distributed system with assumes hardware failure in the network topology.

However, HBase does provide various options to for it such as:

Data center-level failure
Accidental deletion of records/data
For audit purpose

HBase High Performance Cookbook

By : Ruchir Choudhry

HBase High Performance Cookbook

By: Ruchir Choudhry

Overview of this book

Related Content you might be interested in

Current Title:

HBase High Performance Cookbook

Hadoop 2.x Administration Cookbook

Apache Hadoop 3 Quick Start Guide

Mastering Hadoop 3

Administering clusters

Getting ready

Tip

How to do it…

Note

Log dump

Metrics dump

Note

How it works…

See also