Data compression involves encoding information using fewer bits than the original representation. Compression has an important role to play in big data technologies. It makes both storage and transport of data more efficient.
When data is compressed, it becomes smaller, so both disk I/O and network I/O become faster. It also saves storage space. Every optimization has a cost, and the cost of compression comes in the form of added CPU cycles to compress and decompress data.
Hadoop needs to split data to put them into blocks, irrespective of whether the data is compressed or not. Only a few compression formats are splittable.
The two most popular compression formats for big data loads are Lempel-Ziv-Oberhumer (LZO) and Snappy. Snappy is not splittable, while LZO is. Snappy, on the other hand, is a much faster format.
If the compression format is splittable like LZO, the input file is first split into blocks and then compressed. Since compression happened...