Book Image

Apache Hive Essentials

By : Dayong Du
Book Image

Apache Hive Essentials

By: Dayong Du

Overview of this book

Table of Contents (17 chapters)
Apache Hive Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Data file optimization


Data file optimization covers the performance improvement on the data files in terms of file format, compression, and storage.

File format

Hive supports TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET file formats. The three ways to specify the file format are as follows:

  • CREATE TABLE ... STORE AS <File_Format>

  • ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT <File_Format>

  • SET hive.default.fileformat=<File_Format> --default fileformat for table

Here, <File_Type> is TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET.

We can load a text file directly to a table with the TEXTFILE format. To load data to the table with other file formats, we need to load the data to a TEXTFILE format table first. Then, use INSERT OVERWRITE TABLE <target_file_format_table> SELECT * FROM <text_format_source_table> to convert and insert the data to the file format as expected.

The file formats supported by Hive and their optimizations are as follows...