-
Book Overview & Buying
-
Table Of Contents
Cloudera Administration Handbook
By :
Distributed copy (DistCp) is a Hadoop utility used to copy data in parallel within and between clusters. It uses Hadoop's MapReduce to perform the copy operation. DistCp is the most widely used data transfer tool in Hadoop clusters. For example:
$ hadoop distcp hdfs://namenode1/src hdfs://namenode2/dest
The preceding command would copy the src folder and all its contents from the cluster managed by namenode1 to the cluster managed by namenode2 as the dest folder. DistCp, by default, does not overwrite the files at the target location and skips copying them if the files already exists. However, files can be forced to be overwritten using the overwrite flag.
There are several options that can be used along with the Hadoop distcp command and the details of these options can be found at http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_distcp_data_cluster_migrate.html.