Book Image

Spark Cookbook

By : Rishi Yadav
Book Image

Spark Cookbook

By: Rishi Yadav

Overview of this book

Table of Contents (19 chapters)
Spark Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Using compression to improve performance


Data compression involves encoding information using fewer bits than the original representation. Compression has an important role to play in big data technologies. It makes both storage and transport of data more efficient.

When data is compressed, it becomes smaller, so both disk I/O and network I/O become faster. It also saves storage space. Every optimization has a cost, and the cost of compression comes in the form of added CPU cycles to compress and decompress data.

Hadoop needs to split data to put them into blocks, irrespective of whether the data is compressed or not. Only few compression formats are splittable.

Two most popular compression formats for big data loads are LZO and Snappy. Snappy is not splittable, while LZO is. Snappy, on the other hand, is a much faster format.

If compression format is splittable like LZO, input file is first split into blocks and then compressed. Since compression happened at block level, decompression can happen...