HBase High Performance Cookbook

HBase High Performance Cookbook

By : Ruchir Choudhry

Buy this Book

HBase High Performance Cookbook

By: Ruchir Choudhry

Buy this Book

Overview of this book

Apache HBase is a non-relational NoSQL database management system that runs on top of HDFS. It is an open source, disturbed, versioned, column-oriented store and is written in Java to provide random real-time access to big Data. We’ll start off by ensuring you have a solid understanding the basics of HBase, followed by giving you a thorough explanation of architecting a HBase cluster as per our project specifications. Next, we will explore the scalable structure of tables and we will be able to communicate with the HBase client. After this, we’ll show you the intricacies of MapReduce and the art of performance tuning with HBase. Following this, we’ll explain the concepts pertaining to scaling with HBase. Finally, you will get an understanding of how to integrate HBase with other tools such as ElasticSearch. By the end of this book, you will have learned enough to exploit HBase for boost system performance.

HBase High Performance Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Configuring HBase

Introduction

Configuring and deploying HBase

Using the filesystem

Administering clusters

Managing clusters

Loading Data from Various DBs

Introduction

Extracting data from Oracle

Loading data using Oracle Big data connector

Bulk utilities

Using Hive with Apache HBase

Using Sqoop

Working with Large Distributed Systems Part I

Introduction

Scaling elastically or Auto Scaling with built-in fault tolerance

Auto Scaling HBase using AWS

Works on different VM/physical, cloud hardware

Working with Large Distributed Systems Part II

Snappy

Working with Scalable Structure of tables

Introduction

HBase data model part 1

HBase data model part 2

How HBase truly scales on key and schema design

HBase Clients

Introduction

HBase REST and Java Client

Working with Apache Thrift

Working with Apache Avro

Working with Protocol buffer

Working with Pig and using Shell

Large-Scale MapReduce

Introduction

HBase Performance Tuning

Introduction

Working with infrastructure/operating systems

Working with Java virtual machines

Changing the configuration of components

Working with HDFS

Performing Advanced Tasks on HBase

Machine learning using Hbase

Real-time data analysis using Hbase and Mahout

Full text indexing using Hbase

Optimizing Hbase for Cloud

Introduction

Configuring Hbase for the Cloud

Connecting to an Hbase cluster using the command line

Backing up and restoring Hbase

Terminating an HBase cluster

Accessing HBase data with hive

Viewing the Hbase user interface

Monitoring HBase with CloudWatch

Monitoring Hbase with Ganglia

Case Study

Introduction

Configuring Lily Platform

Integrating elastic search with Hbase

Configuring

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Managing clusters

In HBase ecosystem, it's must to monitor the cluster to control and improve their performance and states as it grows. As HBase sits on top of Hadoop ecosystem and serves real-time user traffic, it's essential to see the performance of the cluster at any given point of time, this allows us to detect the problem well in advance and take corrective actions before it happens.

Getting ready

It is important to know some of the details of Ganglia and its distributed components before we get into the details of managing clusters

gmond

This is an acronym for a low footprint service known as Ganglia Monitoring Daemon. This service needs to be installed at each node from where we want to pull the matrix. This daemon is the actual workhorse and collects the data of each host by listen/announce protocol. It also helps collect some of the core metrics such as disk, active process, network, memory, and CPU/VCPUs.

gmetad

This is an acronym for Ganglia meta daemon. It is a service that collects data from other gmetad and gmond and mushes it together into a single meta-cluster image. The format used to store the data is RRD and XML. This enables the client application browsing.

gweb

It's a web interface or a view to the data that is collected by the earlier two services. It's a PHP-based web interface. It requires the following:

Apache web server
PHP 5.2 or later
The PHP json extension

How to do it…

We will divide our how to do it into two sections. In the first section, we will talk about installing Ganglia on all the nodes.

Once it's done, we will do the integration with HBase so that the relevant metrics are available.

Ganglia setup

To install Ganglia it is best to use prebuild binary package that is available from the vendor distributions. This will help in dealing with the pre-requisites libraries. Alternatively, it can be downloaded from the Ganglia website, http://sourceforge.net/projects/ganglia/files/latest/download?source=files.

If you are using browser from command prompt, you can do it by using following command:

wget –o http://downloads.sourceforge.net/project/ganglia/\ganglia%20monitoring%20core/3.0.7%20%28Fossett%29/ganglia-3.0.7.tar.gz

When doing wget, use it as a single line on your shell. Use sudo in case you don't have privilege for the current directory or download it in /tmp and later on copy to the respective location.

tar –xzvf ganglia-3.0.7.tar.gz –c /opt/HBase B
rm –rf ganglia-3.0.7.tar.gz // it will delete the tar file which is not needed now.
Now let's Install the dependencies
```
sudo apt-get –y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev
```
The -y options means that apt-get won't wait for users confirmation. It will assume yes when question for confirmation would appear.

Building and installing the downloaded and exploded binary:

cd /opt/HBase B/ganglia-3.0.7
./configure --- is a configuration command on linux env
make
sudo make install

Once the preceding step is completed, you can generate a default configuration file by:

gmond --default_con
fig > /etc/gmond.conf        --use "sudo su - " in case there is a privilege issue
sudo su – will make you a root user and will allow the system library to be accessed by the gmond.conf

vi /etc/gmond.conf and change the following:
```
globals
{
user=HBase gangila in place of above.
}
```
Note
In case you are using a specific user to perform ganglia task then change the above and add this user as shown above.

The recommendation will be to create this user by the following commands:

sudo adduser --disabled-login --no-create-home ganglia
cluster {
name =HBase B --- name of your cluster will be used 
owner ="HBase B Company"
url =http://yourHBase bMaster.ganglia-monitor.com/
--- url of the main monitor or the CNAME 
}

The UDP setup, which is the default setup, if good for fewer than 120 nodes. For more than 120 nodes, we have to switch to unicast.

The setup is as follows:

Change in /etc/gmond.conf
Udp_send_channel
{
#mcast_join=--your IP address to join in  
host = yourHBase bMaster.ganglia-monitor.com
post=8649
# ttl=1 
}
udp_recv_channel
{
#mcast_join=--your IP address to join in  
port =8649
# bind =--your IP address to join in  
}

Start the monitoring daemon with:
```
sudo gmond
```
We can test it by nc <hostname> 8649 or telnet hostname 8649
Note
You have to kill the daemon thread to stop it using ps –ef | grep gmond. This will provide the process ID with the following process:
Execute sudo kill -9 <PID>
Now we have to install Ganglia meta daemon. It is good to have one if the cluster is less than 100 nodes. This is the workhorse and it will require powerful machine with decent compute power, as these are responsible for creating graphs.

Let's move ahead:

cd  /u/HBase B/ganglia-3.0.7
./configure –-with-gmetad
make
sudo make install
sudo cp /u/HBase B/gangli-3.0.7/gmetad/gmetad.conf /etc/gmetad.conf

Open using sudo vi /etc/gmrtad.conf change the code:

setuid_username "ganglia"
data_source "HBase B"  yourHBase bMaster.ganglia-monitor.com
gridename "<our grid name say HBase B Grid>"

Now we need to create directories, which will store data in a round-robin database (rrds):
```
mkdir –p /var/lib/ganglia/rrds
```
Now let's change the ownership to ganglia users, so that it can read and write as needed.
```
chown –R ganglia:ganglia /var/lib/ganglia/
```
Let's start the daemon:
```
gmetad
```
Note
You have to kill the daemon thread to stop it using ps –ef | grep gmetad. This will provide the process ID with the process.
Execute sudo kill -9 <PID>
Now, let's focus on Ganglia web.
```
sudo apt-get -y install rrdtool apache2 php5-mysql libapache2-mod-php5 php5-gd
```
Tip
Note that this will install rrdtool (round robin database tool), Apache/httpd, php5 connector (apache to mysql), Php5-mysql drivers, and so on.

Copy the PHP-based file to the following locations:

cp –r /u/HBase B/ganglia-3.0.7/web  /var/www/ganglia
sudo /etc/init.d/apache2 restart ( others which can be used are, status, stop )

Point http:// HBase bMaster.ganglia-monitor.com/ganglia, you should start seeing the basic graphs as the HBase setup is still not done.

Integrate HBase and Ganglia:

vi  /u/HBase B/HBase -0.98.5-hadoop2/conf /hadoop-metrics2-HBase .properties

Change the below parameter for getting different status on the ganglia:

HBase .extendedperiod = 3600
HBase .class= org.apache.hadoop.metrics2.sink.FileSink
HBase .period=5
HBase .servers=master2:8649
# jvm context will provide memory used , thread count in JVM etc.
jvm.class= org.apache.hadoop.metrics2.sink.FileSink
jvm.period=5
# enable rpc context to see the metrics on each HBase rpc method invocation.

jvm.servers=master2:8649
rpc.class= org.apache.hadoop.metrics2.sink.FileSink
rpc.period=5
rpc.servers=master2:8649

Copy the /u/HBase B/HBase B/HBase -0.98.5-hadoop2/conf/ hadoop-metrics2-HBase .properties to all the nodes and restart HMaster and all the region servers:

How it works…

As the system grows from a few nodes to the tens or hundreds or becomes a very large cluster having more than hundreds of nodes it's pivotal to have a holistic view, drill down view, historical view of the logs at any given point of time in a graphical representation. In a large or very large installation, administrators are more concerned about redundancy, which avoids single point of failure. HBase and underlying HDFS are designed to handle the node failures gracefully, but it's equally important to monitor these failure as this can lead to pull down the cluster if a corrective action is not taken in time. HBase exposes various matrix to JMX and Ganglia like HMaster, region servers statistics, JMV (Java virtual machines), RPC (Remote procedure calls), Hadoop/HDFS, MapReduce details. Taking into consideration all these points and various other salient and powerful features, we considered Ganglia.

Ganglia provides the following advantages:

It provides near-real-time monitoring for all the vital information of a very large cluster.
It runs on commodity hardware and can be suited for most of the popular OS.
Its open sourced and relatively easy to install.
It integrates easily with traditional monitoring systems
It provides an overall view of all nodes in a grid and all nodes in the cluster.
The monitored data is available in both textual and graphic format.
Works on multicast listen/announce protocol.
Works with open standards.
- JSON
- XML
- XDR
- RRDTool
- APR – Apache portable runtime
- Apache HTTPD server
- PHP-based web interface

HBase works with only 3.0.X and higher version of Ganglia, hence we used 3.0.7 version.

In step 4, we installed the dependencies of libraries, which will be required for the ganglia to compile.

In step 5, we compiled ganglia and installed it by running the configure command, then we used make and then make install command.

In step 6, we created a file gmond.conf, and later on in step 7, we changed the setting to point to HBase master node. We also configured the port to 8649 with a user ganglia who can read from the cluster. By commenting the multicast address and the TTL (time to live), we also changed the UDP-based multicasting to which is a default one to unicasting, which enables us to expand the cluster to above 120 nodes. We also added a master Gmond node in this config file.

In step 8 we started the gmond and got some core monitoring such as CPU, disk, network, memory, and load average of the nodes.

In step 9, we went back to the /u/HBase B/ganglia-3.0.7/ and reran the configuration, but this time, we added configure –with-gmetad, so that it complies with gmetad.

In step 11, we copied the gmetad.conf from.

sudo /u/HBase B/gangli-3.0.7/gmetad/gmetad.conf to /etc/gmetad.conf.

In step 12, we added ganglia user and Master details in the data_source HBase B HBase bMaster.ganglia-monitor.com.

In step 13/14, we create the rrds directory that will hold the data in round-robin databases; later on, we stated the gmetad daemon on the master nodes.

In step 15, we installed all the dependency, which is required to run the web interface.

In step 16, we copied the web .php file from the existing location.

(/u/HBase B/ganglia-3.0.7/web) to ( /var/www/ganglia)

In step 17, we restarted the apache instance and saw all the basic graphs, which provides the details of the nodes and the host but not HBase details. We also copied it to all the nodes so that we have a similar configuration and the Ganglia master is getting the data from the child nodes.

In step 18, we changed the setting in hadoop-metrics2-HBase .properties so that it starts collecting the metrics and starts sending it to the ganglia servers on port 8649. The main class that is responsible for providing these details is org.apache.hadoop.metrics2.sink.FileSink and it properties.

Now we point at the URL of master, and once the page is rendered, it starts showing the graphs as described by the image HBase -Ganglia-MasterAndRegion01-01.png. It starts showing the following graphs:

Memory and CPU usage
JVM details (GC cycle, memory consumed by JVM, threads used, heap consumed, and so on)
HBase Master details
HBase Region compaction queue details
Region server flush queue utilizations
Region servers IO

There is more…

Ganglia is used for monitoring very large cluster, and in the word of Hadoop/HBase , it can be very useful as it provides the following:

JVM
HDFS
Map reduce
Region compaction time
Region store files
Region block cache hit ratio
Master spilt size
Master split number of operations
Region block free
Name Node activities
Secondary name node details
Disk status

HBase High Performance Cookbook

By : Ruchir Choudhry

HBase High Performance Cookbook

By: Ruchir Choudhry

Overview of this book

Related Content you might be interested in

Current Title:

HBase High Performance Cookbook

Hadoop 2.x Administration Cookbook

Apache Hadoop 3 Quick Start Guide

Mastering Hadoop 3

Managing clusters

Getting ready

gmond

gmetad

gweb

How to do it…

Ganglia setup

Note

Note

Note

Tip

How it works…

There is more…

See also