Book Image

Mastering Hadoop

By : Sandeep Karanth
Book Image

Mastering Hadoop

By: Sandeep Karanth

Overview of this book

Table of Contents (21 chapters)
Mastering Hadoop
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Amazon AWS S3


S3, short for Simple Storage Service, is Amazon's storage as a service offering. It provides reliable storage for data by providing redundancy. The consumer is charged for storage of data on S3 based on the amount of storage used. Any download of data from S3 is also charged, but data upload and transfer of data between AWS properties are free of charge. This makes it extremely attractive for the user to run EMR (Elastic Map Reduce) on AWS and have data stored on S3.

S3 can be used as the input and output data store for MapReduce jobs. The intermediate files can be stored on local disks or the HDFS of the EMR cluster. This also allows easy sharing of input and results among different people in the organization without fearing data loss, with high data security. If an EMR cluster gets terminated accidentally, all of the HDFS data will be lost unless it is moved out. Using S3 for input and output mitigates such risks.

However, S3 is significantly slower because it does not provide...