Book Image

Cloudera Administration Handbook

By : Rohit Menon
Book Image

Cloudera Administration Handbook

By: Rohit Menon

Overview of this book

Table of Contents (17 chapters)
Cloudera Administration Handbook
Credits
Notice
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Using the distributed copy (DistCp)


Distributed copy (DistCp) is a Hadoop utility used to copy data in parallel within and between clusters. It uses Hadoop's MapReduce to perform the copy operation. DistCp is the most widely used data transfer tool in Hadoop clusters. For example:

$ hadoop distcp hdfs://namenode1/src hdfs://namenode2/dest

The preceding command would copy the src folder and all its contents from the cluster managed by namenode1 to the cluster managed by namenode2 as the dest folder. DistCp, by default, does not overwrite the files at the target location and skips copying them if the files already exists. However, files can be forced to be overwritten using the overwrite flag.

There are several options that can be used along with the Hadoop distcp command and the details of these options can be found at http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_distcp_data_cluster_migrate.html.