If you are thinking about using HBase in production, you will probably want to understand the backup options and practices of HBase. The challenge is that the dataset you need to back up might be huge, so the backup solution must be efficient. It is expected to be able to scale to hundreds of terabytes of storage, and finish restoring the data in a reasonable time frame.
There are two strategies for backing up HBase:
Backing it up with a full cluster shutdown
Backing it up on a live cluster
A full shutdown backup has to stop HBase (or disable all tables) at first, then use Hadoop's distcp
command to copy the contents of an HBase directory to either another directory on the same HDFS, or to a different HDFS. To restore from a full shutdown backup, just copy the backed up files, back to the HBase directory using distcp
.
There are several approaches for a live cluster backup:
Using the
CopyTable
utility to copy data from one table to anotherExporting an HBase table to HDFS files...