Book Image

Learning Big Data with Amazon Elastic MapReduce

By : Amarkant Singh, Vijay Rayapati
Book Image

Learning Big Data with Amazon Elastic MapReduce

By: Amarkant Singh, Vijay Rayapati

Overview of this book

<p>Amazon Elastic MapReduce is a web service used to process and store vast amount of data, and it is one of the largest Hadoop operators in the world. With the increase in the amount of data generated and collected by many businesses and the arrival of cost-effective cloud-based solutions for distributed computing, the feasibility to crunch large amounts of data to get deep insights within a short span of time has increased greatly.</p> <p>This book will get you started with AWS so that you can quickly create your own account and explore the services provided, many of which you might be delighted to use. This book covers the architectural details of the MapReduce framework, Apache Hadoop, various job models on EMR, how to manage clusters on EMR, and the command-line tools available with EMR. Each chapter builds on the knowledge of the previous one, leading to the final chapter where you will learn about solving a real-world use case using Apache Hadoop and EMR. This book will, therefore, get you up and running with major Big Data technologies quickly and efficiently.</p>
Table of Contents (18 chapters)
Learning Big Data with Amazon Elastic MapReduce
Credits
About the Authors
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Getting started with Amazon S3


S3 is a service aimed at making developers and businesses free from worrying about having enough storage available. It is a very robust and reliable service that enables you to store any amount of data and ensures that your data will be available when you need it.

Creating a S3 bucket

Creating a S3 bucket is just a matter of a few clicks and setting a few parameters such as the name of the bucket. Let's have a walk-through of the simple steps required to create a S3 bucket from the AWS management console:

  1. Go to the S3 dashboard and click on Create Bucket.

  2. Enter a bucket name of your choice and select the AWS region in which you want to create your bucket.

  3. That's all, just click on Create and you are done.

Bucket naming

The bucket name you choose should be unique among all existing bucket names in Amazon S3. Because bucket names form a part of the URL to access its objects via HTTP, it is required to follow DNS naming conventions.

The DNS naming conventions include the following rules:

  • It must be at least three and no more than 63 characters long.

  • It must be a series of one or more labels. Adjacent labels are separated by a single period (.).

  • It can contain lowercase letters, numbers, and hyphens.

  • Each individual label within a name must start and end with a lowercase letter or a number.

  • It must not be formatted as an IP address.

Some examples of valid and invalid bucket names are listed in the following table:

Invalid bucket name

Valid bucket name

TheAwesomeBucket

the.awesome.bucket

.theawesomebucket

theawesomebucket

the..awesomebucket

the.awesomebucket

Now, you can easily upload your files in this bucket by clicking on the bucket name and then clicking on Upload. You can also create folders inside the bucket.

Note

Apart from accessing S3 from the AWS management console, there are many independently created S3 browsers available for various operating systems. For Windows, there is CloudBerry and there is Bucket Explorer for Linux. Also, there are nice plugins available for Chrome and Firefox.

S3cmd

S3cmd is a free command-line tool to upload, retrieve, and manage data on Amazon S3. It boasts some of the advanced features such as multipart uploads, encryption, incremental backup, and S3 sync among others. You can use S3cmd to automate your S3-related tasks.

You may download the latest version of S3cmd from http://s3tools.org and check for instructions on the website regarding installing it. This is a separate open source tool that is not developed by Amazon.

In order to use S3cmd, you will need to first configure your S3 credentials. To configure credentials, you need to execute the following command:

s3cmd –-configure

You will be prompted for two keys: Access Key and Secret Key. You can get these keys from the IAM dashboard of your AWS management console. You may leave default values for other configurations.

Now, by using very intuitive commands, you may access and manage your S3 buckets. These commands are mentioned in the following table:

Task

Command

List all the buckets

s3cmd ls

Create a bucket

s3cmd mb s3://my.awesome.unique.bucket

List the contents of a bucket

s3cmd ls s3://my.awesome.unique.bucket

Upload a file into a bucket

s3cmd put /myfilepath/myfilename.abc s3://my.awesome.unique.bucket

Download a file

S3cmd get s3://my.awesome.unique.bucket/ myfilename.abc /myfilepath/