Book Image

Hadoop Real-World Solutions Cookbook

By : Jonathan R. Owens, Jon Lentz, Brian Femiano
Book Image

Hadoop Real-World Solutions Cookbook

By: Jonathan R. Owens, Jon Lentz, Brian Femiano

Overview of this book

<p>Helping developers become more comfortable and proficient with solving problems in the Hadoop space. People will become more familiar with a wide variety of Hadoop related tools and best practices for implementation.</p> <p>Hadoop Real-World Solutions Cookbook will teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.</p> <p>Hadoop Real-World Solutions Cookbook provides in depth explanations and code examples. Each chapter contains a set of recipes that pose, then solve, technical challenges, and can be completed in any order. A recipe breaks a single problem down into discrete steps that are easy to follow. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.<br /><br />Hadoop Real-World Solutions Cookbook will give readers the examples they need to apply Hadoop technology to their own problems.</p>
Table of Contents (17 chapters)
Hadoop Real-World Solutions Cookbook
Credits
About the Authors
About the Reviewers
www.packtpub.com
Preface
Index

Enabling MapReduce jobs to skip bad records


When working with the amounts of data that Hadoop was designed to process, it is only a matter of time before even the most robust job runs into unexpected or malformed data. If not handled properly, bad data can easily cause a job to fail. By default, Hadoop will not skip bad data. For some applications, it may be acceptable to skip a small percentage of the input data. Hadoop provides a way to do just that. Even if skipping data is not acceptable for a given use case, Hadoop's skipping mechanism can be used to pinpoint the bad data and log it for review.

How to do it...

  1. To enable the skipping of 100 bad records in a map job, add the following to the run() method where the job configuration is set up:

    SkipBadRecords.setMapperMaxSkipRecords(conf, 100);
  2. To enable the skipping of 100 bad record groups in a reduce job, add the following to the run() method where the job configuration is set up:

    SkipBadRecords.setReducerMaxSkipGroups(conf, 100);

How it...