Book Image

Learning HBase

By : Shashwat Shriparv
Book Image

Learning HBase

By: Shashwat Shriparv

Overview of this book

Table of Contents (18 chapters)
Learning HBase
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Comparing functional differences between RDBMs and HBase


Lately, we are hearing about NoSQL databases such as HBase, so let's just understand what actually HBase has and lacks in comparison to conventional relational databases that have existed for so long now. The following table differentiates it well:

Relational database

HBase

This supports scale up. In other words, when more disk and memory processing power is needed, we need to upgrade it to a more powerful server.

This supports scale out. In other words, when more disk and memory processing power is needed, we need not upgrade the server. However, we need to add new servers to the cluster.

This uses SQL queries for reading records from tables.

This uses APIs and MapReduce for accessing data from HBase tables.

This is row oriented, that is, each row is a contiguous unit of page.

This is column oriented, that is, each column is a contiguous unit of page.

The amount of data depends on configuration of server.

The amount of data does not depend on the particular machine but the number of machines.

It's Schema is more restrictive.

Its schema is flexible and less restrictive.

This has ACID support.

There is no built-in support for HBase.

This is suited for structured data.

This is suited to both structured and nonstructural data.

Conventional relational database is mostly centralized.

This is always distributed.

This mostly guarantees transaction integrity.

There is no transaction guaranty in HBase.

This supports JOINs.

This does not support JOINs.

This supports referential integrity.

There is no in-built support for referential integrity.

So with these differences, both have their own usability and use cases. When we have a small amount of data that can be accommodated in RDBMS without performance lagging, we can go with RDBMS.

When we need more Online Transaction Processing (OLTP) and the transaction type of processing, RDBMS is easy to go. When we have a huge amount of data (in terabytes and petabytes), we should look towards HBase, which is always better for aggregation on columns and faster processing.

We have gone through the word, column-oriented database, in the previous introduction; now let's discuss the difference between the column-oriented databases and the row-oriented databases, which are the traditional relational databases.

These column-oriented database systems have been shown to perform more than an order of magnitude, better than traditional row-oriented database systems on analytical workloads found in data warehouse systems, decision system, and business intelligence applications. These are more I/O-efficient for write-once read-many queries.

Logical view of row-oriented databases

The following figure shows how data is represented in relational databases:

Logical view of column-oriented databases

The following figure shows how logically we can represent NoSQL/column-oriented databases such as HBase:

Row-oriented data stores store rows in a contiguous unit on the page, and the number of rows are packed into a page. They are much faster for small numbers of rows and slow for aggregation. On the contrary, column-oriented data stores columns in a contiguous unit on the page, columns may extend up to millions of entries, so they run for many pages. These are much faster for aggregation and analytics. The root of column-oriented database systems can be traced to the 1970 when transposed file first appeared. Column-oriented data stores are better for compression than row-oriented data stores. The following is the comparison between these two:

Row-oriented data stores

Column-oriented data stores

These are efficient for addition/modification of records

These are efficient for reading data

They read pages containing entire rows

They read only needed columns

These are best for OLTP

These are not so optimized for OLTP yet

This serializes all the values in a row together, then the value in the next row, and so on

This serializes all the value of columns together and so on

Row data are stored in contiguous pages in memory or on disk

Columns are stored in pages in memory or on disk

Suppose the records of a table are stored in the pages of memory. When they need to be accessed, these pages are brought to the primary memory, if they are not already present in the memory.

If one row occupies a page and we need all specific column such as salary or rate of interest from all the rows for some kind of analytics, each page containing the columns has to be brought in the memory; so this page in page out will result in a lot of I/O, which may result in prolonged processing time.

In column-oriented databases, each column will be stored in pages. If we need to fetch a specific column, there will be less I/O as only the pages that contain the specified column needed to be brought in the main memory and read, and we need not bring and read all the pages containing rows/records henceforth into the memory. So the kind of queries where we need to just fetch specific columns and not whole record(s) or sets is served best in column-oriented database, which is useful for analytics wherein we can fetch some columns and do some mathematical operations such as sum and average.

Pros and cons of column-oriented databases

The following are pros of column-oriented database:

  • This has built-in support for efficient and data compression.

  • This supports fast data retrieval.

  • Administration and configuration is simplified. It can be scaled out and hence is very easy to expand.

  • This is good for high performance on aggregation queries (such as COUNT, SUM, AVG, MIN, and MAX).

  • This is efficient for partitioning as it provides features of automatic sharding mechanism to distribute bigger regions to smaller ones.

The following are cons of column-oriented database:

  • Queries with JOINs and data from many tables are not optimized.

  • It records and deletes lot of updates and has to make frequent compaction and splits too. This reduces its storage efficiency.

  • Partitioning or indexing schemes can be difficult to design as a relational concept is not implicit.