Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Hive and Amazon Web Services


With Elastic MapReduce as the AWS Hadoop-on-demand service, it is of course possible to run Hive on an EMR cluster. But it is also possible to use Amazon storage services, particularly S3, from any Hadoop cluster be it within EMR or your own local cluster.

Hive and S3

As mentioned in Chapter 2, Storage, it is possible to specify a default filesystem other than HDFS for Hadoop and S3 is one option. But, it doesn't have to be an all-or-nothing thing; it is possible to have specific tables stored in S3. The data for these tables will be retrieved into the cluster to be processed, and any resulting data can either be written to a different S3 location (the same table cannot be the source and destination of a single query) or onto HDFS.

We can take a file of our tweet data and place it onto a location in S3 with a command such as the following:

$ aws s3 put tweets.tsv s3://<bucket-name>/tweets/

We firstly need to specify the access key and secret access key that...