Book Image

Mastering Hadoop

By : Sandeep Karanth
Book Image

Mastering Hadoop

By: Sandeep Karanth

Overview of this book

Table of Contents (21 chapters)
Mastering Hadoop
About the Author
About the Reviewers

File formats

Hive supports a number of file formats out of the box. In this section, we will inspect some of these file formats and their utilities.

Compressed files

For some use cases, storing files in a compressed format within HDFS is advantageous. This strategy not only uses less storage, it also can reduce query times. Hive provides importing files stored in GZIP and BZIP2 formats directly into tables. During query execution, these files are decompressed and given as inputs to Map tasks. However, files compressed with GZIP and BZIP2 compression schemes cannot be split and are processed within a single Map task.

In practice, files stored in these compressed file formats are loaded into a table whose underlying data format is a Sequence file. Sequence files can be split and distributed to different Map tasks.


The io.seqfile.compression.type property tells Hive how the compression of the Sequence file should happen. It can take two values, RECORD, where each record is compressed and, BLOCK...