Book Image

Mastering Hadoop

By : Karanth
Book Image

Mastering Hadoop

By: Karanth

Overview of this book

Do you want to broaden your Hadoop skill set and take your knowledge to the next level? Do you wish to enhance your knowledge of Hadoop to solve challenging data processing problems? Are your Hadoop jobs, Pig scripts, or Hive queries not working as fast as you intend? Are you looking to understand the benefits of upgrading Hadoop? If the answer is yes to any of these, this book is for you. It assumes novice-level familiarity with Hadoop.
Table of Contents (15 chapters)
14
Index

MapReduce output


The output is dependent on the number of Reduce tasks present in the job. Some guidelines to optimize outputs are as follows:

  • Compress outputs to save on storage. Compression also helps in increasing HDFS write throughput.

  • Avoid writing out-of-band side files as outputs in the Reduce task. If statistical data needs to be collected, the use of Counters is better. Collecting statistics in side files would require an additional step of aggregation.

  • Depending on the consumer of the output files of a job, a splittable compression technique could be appropriate.

  • Writing large HDFS files with larger block sizes can help subsequent consumers of the data reduce their Map tasks. This is particularly useful when we cascade MapReduce jobs. In such situations, the outputs of a job become the inputs to the next job. Writing large files with large block sizes eliminates the need for specialized processing of Map inputs in subsequent jobs.

Speculative execution of tasks

Stagglers are slow-running...