Book Image

HBase High Performance Cookbook

By : Ruchir Choudhry
Book Image

HBase High Performance Cookbook

By: Ruchir Choudhry

Overview of this book

Apache HBase is a non-relational NoSQL database management system that runs on top of HDFS. It is an open source, disturbed, versioned, column-oriented store and is written in Java to provide random real-time access to big Data. We’ll start off by ensuring you have a solid understanding the basics of HBase, followed by giving you a thorough explanation of architecting a HBase cluster as per our project specifications. Next, we will explore the scalable structure of tables and we will be able to communicate with the HBase client. After this, we’ll show you the intricacies of MapReduce and the art of performance tuning with HBase. Following this, we’ll explain the concepts pertaining to scaling with HBase. Finally, you will get an understanding of how to integrate HBase with other tools such as ElasticSearch. By the end of this book, you will have learned enough to exploit HBase for boost system performance.
Table of Contents (19 chapters)
HBase High Performance Cookbook
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
7
Large-Scale MapReduce
Index

Managing clusters


In HBase ecosystem, it's must to monitor the cluster to control and improve their performance and states as it grows. As HBase sits on top of Hadoop ecosystem and serves real-time user traffic, it's essential to see the performance of the cluster at any given point of time, this allows us to detect the problem well in advance and take corrective actions before it happens.

Getting ready

It is important to know some of the details of Ganglia and its distributed components before we get into the details of managing clusters

gmond

This is an acronym for a low footprint service known as Ganglia Monitoring Daemon. This service needs to be installed at each node from where we want to pull the matrix. This daemon is the actual workhorse and collects the data of each host by listen/announce protocol. It also helps collect some of the core metrics such as disk, active process, network, memory, and CPU/VCPUs.

gmetad

This is an acronym for Ganglia meta daemon. It is a service that collects data from other gmetad and gmond and mushes it together into a single meta-cluster image. The format used to store the data is RRD and XML. This enables the client application browsing.

gweb

It's a web interface or a view to the data that is collected by the earlier two services. It's a PHP-based web interface. It requires the following:

  • Apache web server

  • PHP 5.2 or later

  • The PHP json extension

How to do it…

We will divide our how to do it into two sections. In the first section, we will talk about installing Ganglia on all the nodes.

Once it's done, we will do the integration with HBase so that the relevant metrics are available.

Ganglia setup

To install Ganglia it is best to use prebuild binary package that is available from the vendor distributions. This will help in dealing with the pre-requisites libraries. Alternatively, it can be downloaded from the Ganglia website, http://sourceforge.net/projects/ganglia/files/latest/download?source=files.

If you are using browser from command prompt, you can do it by using following command:

wget –o http://downloads.sourceforge.net/project/ganglia/\ganglia%20monitoring%20core/3.0.7%20%28Fossett%29/ganglia-3.0.7.tar.gz

When doing wget, use it as a single line on your shell. Use sudo in case you don't have privilege for the current directory or download it in /tmp and later on copy to the respective location.

  1. tar –xzvf ganglia-3.0.7.tar.gz –c /opt/HBase B

  2. rm –rf ganglia-3.0.7.tar.gz // it will delete the tar file which is not needed now.

  3. Now let's Install the dependencies

    sudo apt-get –y install build-essential libapr1-dev libconfuse-dev libexpat1-dev python-dev
    

    The -y options means that apt-get won't wait for users confirmation. It will assume yes when question for confirmation would appear.

  4. Building and installing the downloaded and exploded binary:

    cd /opt/HBase B/ganglia-3.0.7
    ./configure --- is a configuration command on linux env
    make
    sudo make install
    
  5. Once the preceding step is completed, you can generate a default configuration file by:

    gmond --default_con
    fig > /etc/gmond.conf        --use "sudo su - " in case there is a privilege issue
    sudo su – will make you a root user and will allow the system library to be accessed by the gmond.conf
    
  6. vi /etc/gmond.conf and change the following:

    globals
    {
    user=HBase gangila in place of above.
    }
    

    Note

    In case you are using a specific user to perform ganglia task then change the above and add this user as shown above.

  7. The recommendation will be to create this user by the following commands:

    sudo adduser --disabled-login --no-create-home ganglia
    cluster {
    name =HBase B --- name of your cluster will be used 
    owner ="HBase B Company"
    url =http://yourHBase bMaster.ganglia-monitor.com/
    --- url of the main monitor or the CNAME 
    }
    
  8. The UDP setup, which is the default setup, if good for fewer than 120 nodes. For more than 120 nodes, we have to switch to unicast.

    The setup is as follows:

    Change in /etc/gmond.conf
    Udp_send_channel
    {
    #mcast_join=--your IP address to join in  
    host = yourHBase bMaster.ganglia-monitor.com
    post=8649
    # ttl=1 
    }
    udp_recv_channel
    {
    #mcast_join=--your IP address to join in  
    port =8649
    # bind =--your IP address to join in  
    }
  9. Start the monitoring daemon with:

    sudo gmond
    

    We can test it by nc <hostname> 8649 or telnet hostname 8649

    Note

    You have to kill the daemon thread to stop it using ps –ef | grep gmond. This will provide the process ID with the following process:

    Execute sudo kill -9 <PID>

  10. Now we have to install Ganglia meta daemon. It is good to have one if the cluster is less than 100 nodes. This is the workhorse and it will require powerful machine with decent compute power, as these are responsible for creating graphs.

  11. Let's move ahead:

    cd  /u/HBase B/ganglia-3.0.7
    ./configure –-with-gmetad
    make
    sudo make install
    sudo cp /u/HBase B/gangli-3.0.7/gmetad/gmetad.conf /etc/gmetad.conf
    
  12. Open using sudo vi /etc/gmrtad.conf change the code:

    setuid_username "ganglia"
    data_source "HBase B"  yourHBase bMaster.ganglia-monitor.com
    gridename "<our grid name say HBase B Grid>"
  13. Now we need to create directories, which will store data in a round-robin database (rrds):

    mkdir –p /var/lib/ganglia/rrds
    

    Now let's change the ownership to ganglia users, so that it can read and write as needed.

    chown –R ganglia:ganglia /var/lib/ganglia/
    
  14. Let's start the daemon:

    gmetad
    

    Note

    You have to kill the daemon thread to stop it using ps –ef | grep gmetad. This will provide the process ID with the process.

    Execute sudo kill -9 <PID>

  15. Now, let's focus on Ganglia web.

    sudo apt-get -y install rrdtool apache2 php5-mysql libapache2-mod-php5 php5-gd
    

    Tip

    Note that this will install rrdtool (round robin database tool), Apache/httpd, php5 connector (apache to mysql), Php5-mysql drivers, and so on.

  16. Copy the PHP-based file to the following locations:

    cp –r /u/HBase B/ganglia-3.0.7/web  /var/www/ganglia
    sudo /etc/init.d/apache2 restart ( others which can be used are, status, stop )
    
  17. Point http:// HBase bMaster.ganglia-monitor.com/ganglia, you should start seeing the basic graphs as the HBase setup is still not done.

  18. Integrate HBase and Ganglia:

    vi  /u/HBase B/HBase -0.98.5-hadoop2/conf /hadoop-metrics2-HBase .properties 
    
  19. Change the below parameter for getting different status on the ganglia:

    HBase .extendedperiod = 3600
    HBase .class= org.apache.hadoop.metrics2.sink.FileSink
    HBase .period=5
    HBase .servers=master2:8649
    # jvm context will provide memory used , thread count in JVM etc.
    jvm.class= org.apache.hadoop.metrics2.sink.FileSink
    jvm.period=5
    # enable rpc context to see the metrics on each HBase rpc method invocation.
    
    jvm.servers=master2:8649
    rpc.class= org.apache.hadoop.metrics2.sink.FileSink
    rpc.period=5
    rpc.servers=master2:8649
    
  20. Copy the /u/HBase B/HBase B/HBase -0.98.5-hadoop2/conf/ hadoop-metrics2-HBase .properties to all the nodes and restart HMaster and all the region servers:

How it works…

As the system grows from a few nodes to the tens or hundreds or becomes a very large cluster having more than hundreds of nodes it's pivotal to have a holistic view, drill down view, historical view of the logs at any given point of time in a graphical representation. In a large or very large installation, administrators are more concerned about redundancy, which avoids single point of failure. HBase and underlying HDFS are designed to handle the node failures gracefully, but it's equally important to monitor these failure as this can lead to pull down the cluster if a corrective action is not taken in time. HBase exposes various matrix to JMX and Ganglia like HMaster, region servers statistics, JMV (Java virtual machines), RPC (Remote procedure calls), Hadoop/HDFS, MapReduce details. Taking into consideration all these points and various other salient and powerful features, we considered Ganglia.

Ganglia provides the following advantages:

  1. It provides near-real-time monitoring for all the vital information of a very large cluster.

  2. It runs on commodity hardware and can be suited for most of the popular OS.

  3. Its open sourced and relatively easy to install.

  4. It integrates easily with traditional monitoring systems

  5. It provides an overall view of all nodes in a grid and all nodes in the cluster.

  6. The monitored data is available in both textual and graphic format.

  7. Works on multicast listen/announce protocol.

  8. Works with open standards.

    • JSON

    • XML

    • XDR

    • RRDTool

    • APR – Apache portable runtime

    • Apache HTTPD server

    • PHP-based web interface

HBase works with only 3.0.X and higher version of Ganglia, hence we used 3.0.7 version.

In step 4, we installed the dependencies of libraries, which will be required for the ganglia to compile.

In step 5, we compiled ganglia and installed it by running the configure command, then we used make and then make install command.

In step 6, we created a file gmond.conf, and later on in step 7, we changed the setting to point to HBase master node. We also configured the port to 8649 with a user ganglia who can read from the cluster. By commenting the multicast address and the TTL (time to live), we also changed the UDP-based multicasting to which is a default one to unicasting, which enables us to expand the cluster to above 120 nodes. We also added a master Gmond node in this config file.

In step 8 we started the gmond and got some core monitoring such as CPU, disk, network, memory, and load average of the nodes.

In step 9, we went back to the /u/HBase B/ganglia-3.0.7/ and reran the configuration, but this time, we added configure –with-gmetad, so that it complies with gmetad.

In step 11, we copied the gmetad.conf from.

sudo /u/HBase B/gangli-3.0.7/gmetad/gmetad.conf to /etc/gmetad.conf.

In step 12, we added ganglia user and Master details in the data_source HBase B HBase bMaster.ganglia-monitor.com.

In step 13/14, we create the rrds directory that will hold the data in round-robin databases; later on, we stated the gmetad daemon on the master nodes.

In step 15, we installed all the dependency, which is required to run the web interface.

In step 16, we copied the web .php file from the existing location.

  • (/u/HBase B/ganglia-3.0.7/web) to ( /var/www/ganglia)

In step 17, we restarted the apache instance and saw all the basic graphs, which provides the details of the nodes and the host but not HBase details. We also copied it to all the nodes so that we have a similar configuration and the Ganglia master is getting the data from the child nodes.

In step 18, we changed the setting in hadoop-metrics2-HBase .properties so that it starts collecting the metrics and starts sending it to the ganglia servers on port 8649. The main class that is responsible for providing these details is org.apache.hadoop.metrics2.sink.FileSink and it properties.

Now we point at the URL of master, and once the page is rendered, it starts showing the graphs as described by the image HBase -Ganglia-MasterAndRegion01-01.png. It starts showing the following graphs:

  • Memory and CPU usage

  • JVM details (GC cycle, memory consumed by JVM, threads used, heap consumed, and so on)

  • HBase Master details

  • HBase Region compaction queue details

  • Region server flush queue utilizations

  • Region servers IO

There is more…

Ganglia is used for monitoring very large cluster, and in the word of Hadoop/HBase , it can be very useful as it provides the following:

  • JVM

  • HDFS

  • Map reduce

  • Region compaction time

  • Region store files

  • Region block cache hit ratio

  • Master spilt size

  • Master split number of operations

  • Region block free

  • Name Node activities

  • Secondary name node details

  • Disk status