Book Image

Optimizing Hadoop for MapReduce

By : Khaled Tannir
Book Image

Optimizing Hadoop for MapReduce

By: Khaled Tannir

Overview of this book

Table of Contents (15 chapters)

Preface

MapReduce is an important parallel processing model for large-scale, data-intensive applications such as data mining and web indexing. Hadoop, an open source implementation of MapReduce, is widely applied to support cluster computing jobs that require low response time.

Most of the MapReduce programs are written for data analysis and they usually take a long time to finish. Many companies are embracing Hadoop for advanced data analytics over large datasets that require time completion guarantees. Efficiency, especially the I/O costs of MapReduce, still needs to be addressed for successful implications. The experience shows that a misconfigured Hadoop cluster can noticeably reduce and significantly downgrade the performance of MapReduce jobs.

In this book, we address the MapReduce optimization problem, how to identify shortcomings, and what to do to get using all of the Hadoop cluster's resources to process input data optimally. This book starts off with an introduction to MapReduce to learn how it works internally, and discusses the factors that can affect its performance. Then it moves forward to investigate Hadoop metrics and performance tools, and identifies resource weaknesses such as CPU contention, memory usage, massive I/O storage, and network traffic.

This book will teach you, in a step-by-step manner based on real-world experience, how to eliminate your job bottlenecks and fully optimize your MapReduce jobs in a production environment. Also, you will learn to calculate the right number of cluster nodes to process your data, to define the right number of mapper and reducer tasks based on your hardware resources, and how to optimize mapper and reducer task performances using compression technique and combiners.

Finally, you will learn the best practices and recommendations to tune your Hadoop cluster and learn what a MapReduce template class looks like.

What this book covers

Chapter 1, Understanding Hadoop MapReduce, explains how MapReduce works internally and the factors that affect MapReduce performance.

Chapter 2, An Overview of the Hadoop Parameters, introduces Hadoop configuration files and MapReduce performance-related parameters. It also explains Hadoop metrics and several performance monitoring tools that you can use to monitor Hadoop MapReduce activities.

Chapter 3, Detecting System Bottlenecks, explores Hadoop MapReduce performance tuning cycle and explains how to create a performance baseline. Then you will learn to identify resource bottlenecks and weaknesses based on Hadoop counters.

Chapter 4, Identifying Resource Weaknesses, explains how to check the Hadoop cluster's health and identify CPU and memory usage, massive I/O storage, and network traffic. Also, you will learn how to scale correctly when configuring your Hadoop cluster.

Chapter 5, Enhancing Map and Reduce Tasks, shows you how to enhance map and reduce task execution. You will learn the impact of block size, how to reduce spilling records, determine map and reduce throughput, and tune MapReduce configuration parameters.

Chapter 6, Optimizing MapReduce Tasks, explains when you need to use combiners and compression techniques to optimize map and reduce tasks and introduces several techniques to optimize your application code.

Chapter 7, Best Practices and Recommendations, introduces miscellaneous hardware and software checklists, recommendations, and tuning properties in order to use your Hadoop cluster optimally.

What you need for this book

Apache Hadoop framework (http://hadoop.apache.org/) with access to a computer running Hadoop on a Linux operating system.

Who this book is for

If you are an experienced MapReduce user or developer, this book will be great for you. The book can also be a very helpful guide if you are a MapReduce beginner or user who wants to try new things and learn techniques to optimize your applications. Knowledge of creating a MapReduce application is not required, but will help you to grasp some of the concepts quicker and become more familiar with the snippets of MapReduce class template code.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive."

A block of code is set as follows:

[default]
exten => s,1,Dial(Zap/1|30)
exten => s,2,Voicemail(u100)
exten => s,102,Voicemail(b100)
exten => i,1,Voicemail(s0)

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

[default]
exten => s,1,Dial(Zap/1|30)
exten => s,2,Voicemail(u100)
exten => s,102,Voicemail(b100)
exten => i,1,Voicemail(s0)

Any command-line input or output is written as follows:

# cp /usr/src/asterisk-addons/configs/cdr_mysql.conf.sample
     /etc/asterisk/cdr_mysql.conf

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes appear in the text like this: "clicking on the Next button moves you to the next screen."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.