Book Image

Learning HBase

By : Shashwat Shriparv
Book Image

Learning HBase

By: Shashwat Shriparv

Overview of this book

Table of Contents (18 chapters)
Learning HBase
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

HBase in the Hadoop ecosystem


Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem:

HBase can work as a separate entity on the local file system (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way.

Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key.

Data representation in HBase

Let's look into the representation of rows and columns in HBase table:

An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data.

So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it.

Hadoop

Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules:

  • Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules.

  • Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local file system that provides high throughput of read and write operations of data on Hadoop.

  • Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management.

  • Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets.

Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem.

Core daemons of Hadoop

The following are the core daemons of Hadoop:

  • NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability.

  • JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster.

  • SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes.

  • DataNode: This will contain the actual data.

  • TaskTracker: This will perform tasks on the local data assigned by the JobTracker.

The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2:

Hadoop 1

Hadoop 2

HDFS

  • NameNode

  • Secondary NameNode

  • DataNode

  • NameNode (more than one active/standby)

  • Checkpoint node

  • DataNode

Processing

  • MapReduce v1

  • JobTracker

  • TaskTracker

  • YARN (MRv2)

  • ResourceManager

  • NodeManager

  • Application Master

Comparing HBase with Hadoop

As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding:

Hadoop/HDFS

HBase

This provide file system for distributed storage

This provides tabular column-oriented data storage

This is optimized for storage of huge-sized files with no random read/write of these files

This is optimized for tabular data with random read/write facility

This uses flat files

This uses key-value pairs of data

The data model is not flexible

Provides a flexible data model

This uses file system and processing framework

This uses tabular storage with built-in Hadoop MapReduce support

This is mostly optimized for write-once read-many

This is optimized for both read/write many