In this section, we will see some of the best practices you should follow while using EMR.
If you need to read a lot of data from S3, then it's recommended to use the S3DistCP
utility to copy data into the local HDFS for analysis instead of directly reading from S3 to improve the performance. The S3DistCP
utility is provided by AWS and it can be scheduled as a first step of your Job Flow to copy data from S3 to the local HDFS for further analysis by the next set of jobs in the Job Flow.
If you have large data to be moved from the local HDFS to S3 for persistence or save results before terminating a transient cluster, then look at the Jets3t toolkit. It provides various tools including data synchronization to move data from local directories to S3. It is ideal for performing data backups to S3.
Also, Aspera Direct-to-S3 is a toolkit-based on proprietary file transfer implementation using UDP to move large amounts of data over the Internet at very high speeds....