Book Image

Learning Big Data with Amazon Elastic MapReduce

By : Amarkant Singh, Vijay Rayapati
Book Image

Learning Big Data with Amazon Elastic MapReduce

By: Amarkant Singh, Vijay Rayapati

Overview of this book

<p>Amazon Elastic MapReduce is a web service used to process and store vast amount of data, and it is one of the largest Hadoop operators in the world. With the increase in the amount of data generated and collected by many businesses and the arrival of cost-effective cloud-based solutions for distributed computing, the feasibility to crunch large amounts of data to get deep insights within a short span of time has increased greatly.</p> <p>This book will get you started with AWS so that you can quickly create your own account and explore the services provided, many of which you might be delighted to use. This book covers the architectural details of the MapReduce framework, Apache Hadoop, various job models on EMR, how to manage clusters on EMR, and the command-line tools available with EMR. Each chapter builds on the knowledge of the previous one, leading to the final chapter where you will learn about solving a real-world use case using Apache Hadoop and EMR. This book will, therefore, get you up and running with major Big Data technologies quickly and efficiently.</p>
Table of Contents (18 chapters)
Learning Big Data with Amazon Elastic MapReduce
Credits
About the Authors
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Chapter 2. MapReduce

We will get into the what and how of MapReduce in a bit, but first let's say you have a simple counting problem at hand. Say, you need to count a number of hits to your website per country or per city. The only hurdle you have in solving this is the sheer amount of input data you have in order to solve this problem. That is, your website is quite popular and you have huge amounts of access logs generated per day. Also, you need to create a system in place which would send a report on a daily basis to the top management showing the number of total views per country.

Had it been a few hundred MBs of access logs or even a few GBs, you could easily create a standalone application that would crunch these data and count the views per country in a few hours. But what to do when the input data is in hundreds of GBs?

The best way to handle this will be to create a processing system that can work on parts of the input data in parallel and ultimately combine all the results. This...