When working with the amounts of data that Hadoop was designed to process, it is only a matter of time before even the most robust job runs into unexpected or malformed data. If not handled properly, bad data can easily cause a job to fail. By default, Hadoop will not skip bad data. For some applications, it may be acceptable to skip a small percentage of the input data. Hadoop provides a way to do just that. Even if skipping data is not acceptable for a given use case, Hadoop's skipping mechanism can be used to pinpoint the bad data and log it for review.
To enable the skipping of 100 bad records in a map job, add the following to the
run()
method where the job configuration is set up:SkipBadRecords.setMapperMaxSkipRecords(conf, 100);
To enable the skipping of 100 bad record groups in a reduce job, add the following to the
run()
method where the job configuration is set up:SkipBadRecords.setReducerMaxSkipGroups(conf, 100);