distcp
(distributed copy) is a tool provided by Hadoop for copying a large dataset on the same, or different HDFS cluster. It uses MapReduce to copy files in parallel, handle error and recovery, and report the job status.
As HBase stores all its files, including system files on HDFS, we can simply use distcp
to copy the HBase directory to either another directory on the same HDFS, or to a different HDFS, for backing up the source HBase cluster.
Note that this is a full shutdown backup solution. The distcp
tool works because the HBase cluster is shut down (or all tables are disabled) and there are no edits to files during the process. Do not use distcp
on a live HBase cluster. Therefore, this solution is for the environment that can tolerate a periodic full shutdown of their HBase cluster. For example, a cluster that is used for backend batch processing and not serving frontend requests.
We will describe how to use distcp
to back up a fully shut down HBase...