Book Image

Learning Big Data with Amazon Elastic MapReduce

By : Amarkant Singh, Vijay Rayapati
Book Image

Learning Big Data with Amazon Elastic MapReduce

By: Amarkant Singh, Vijay Rayapati

Overview of this book

<p>Amazon Elastic MapReduce is a web service used to process and store vast amount of data, and it is one of the largest Hadoop operators in the world. With the increase in the amount of data generated and collected by many businesses and the arrival of cost-effective cloud-based solutions for distributed computing, the feasibility to crunch large amounts of data to get deep insights within a short span of time has increased greatly.</p> <p>This book will get you started with AWS so that you can quickly create your own account and explore the services provided, many of which you might be delighted to use. This book covers the architectural details of the MapReduce framework, Apache Hadoop, various job models on EMR, how to manage clusters on EMR, and the command-line tools available with EMR. Each chapter builds on the knowledge of the previous one, leading to the final chapter where you will learn about solving a real-world use case using Apache Hadoop and EMR. This book will, therefore, get you up and running with major Big Data technologies quickly and efficiently.</p>
Table of Contents (18 chapters)
Learning Big Data with Amazon Elastic MapReduce
Credits
About the Authors
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Chapter 6. Executing Hadoop Jobs on an Amazon EMR Cluster

In this chapter, we will now see how to launch an EMR cluster via the AWS management console. We will then execute the solution that we created in the previous chapter in this cluster. Out of various ways to program a solution on EMR, as we saw in Chapter 4, Amazon EMR – Hadoop on Amazon Web Services, we chose the custom JAR technique, and we will use the JAR we created in the previous chapter.

Before you go ahead and launch your EMR cluster, you will need to make sure that the following two things are taken care of:

  • You need to have an EC2 key pair. If you do not have it, you can get it generated from your AWS management console. You will need this to SSH into the master node of the EMR cluster.

  • You need to upload the input files and the custom JAR we created in Chapter 4, Amazon EMR – Hadoop on Amazon Web Services, to Amazon S3. EMR will fetch the input as well as the program to be executed (a JAR file) by the cluster from S3.