As Hadoop clusters mature, the data residing in them grows, and maintaining a copy of the data turns out to be an important responsibility of a Hadoop administrator. Backing up data from a distributed environment is a challenge due to its ever increasing volume. Setting up backup operations is an important step towards restoring data in case of entire cluster failures. This chapter discusses the various backup and data protection options and will cover the following topics:
Understanding backups
Understanding HDFS backups
Using the distributed copy (DistCp)
Configuring backups using Cloudera Manager