A typical MapReduce job uses parallel mapper tasks to load data from external storage devices, such as hard drives to the main memory. When a job finishes, the reduce tasks write the result data back to the hard drive. In this way, during the life cycle of a MapReduce job, many data copies are created when data is relayed between the hard drive and the main memory. Sometimes, the data is copied over the network from a remote node.
Copying data from and to hard drives and transfers over the network are expensive operations. To reduce the cost of these operations, Hadoop introduced compression on the data.
Data compression in Hadoop is done by a compression codec, which is a program that encodes and decodes data streams. Although compression and decompression can cause additional cost to the system, the advantages far outweigh the disadvantages.
In this section, we will outline steps to configure data compression on a Hadoop cluster.