Book Image

Learning Big Data with Amazon Elastic MapReduce

By : Amarkant Singh, Vijay Rayapati
Book Image

Learning Big Data with Amazon Elastic MapReduce

By: Amarkant Singh, Vijay Rayapati

Overview of this book

<p>Amazon Elastic MapReduce is a web service used to process and store vast amount of data, and it is one of the largest Hadoop operators in the world. With the increase in the amount of data generated and collected by many businesses and the arrival of cost-effective cloud-based solutions for distributed computing, the feasibility to crunch large amounts of data to get deep insights within a short span of time has increased greatly.</p> <p>This book will get you started with AWS so that you can quickly create your own account and explore the services provided, many of which you might be delighted to use. This book covers the architectural details of the MapReduce framework, Apache Hadoop, various job models on EMR, how to manage clusters on EMR, and the command-line tools available with EMR. Each chapter builds on the knowledge of the previous one, leading to the final chapter where you will learn about solving a real-world use case using Apache Hadoop and EMR. This book will, therefore, get you up and running with major Big Data technologies quickly and efficiently.</p>
Table of Contents (18 chapters)
Learning Big Data with Amazon Elastic MapReduce
Credits
About the Authors
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

EMR best practices


In this section, we will see some of the best practices you should follow while using EMR.

Data transfer

If you need to read a lot of data from S3, then it's recommended to use the S3DistCP utility to copy data into the local HDFS for analysis instead of directly reading from S3 to improve the performance. The S3DistCP utility is provided by AWS and it can be scheduled as a first step of your Job Flow to copy data from S3 to the local HDFS for further analysis by the next set of jobs in the Job Flow.

If you have large data to be moved from the local HDFS to S3 for persistence or save results before terminating a transient cluster, then look at the Jets3t toolkit. It provides various tools including data synchronization to move data from local directories to S3. It is ideal for performing data backups to S3.

Also, Aspera Direct-to-S3 is a toolkit-based on proprietary file transfer implementation using UDP to move large amounts of data over the Internet at very high speeds....