Summary
Hive, through its query language HiveQL, brings in SQL and Relational database concepts to Hadoop. The primary use case for Hive is data warehousing and analytical querying for applications such as Business Intelligence. The supporting components of Hive are built to assist this use case. For example, row-columnar file formats are very efficient when performing aggregations on columns.
The key takeaways from this chapter are as follows:
In Hive, a close look has to be kept on the file format used by the underlying table. Text files can be inefficient. Sequence files are better off as they are compressed. Specialized files such as RC and ORC are more suited both in terms of I/O and query performance.
Compression brings in efficiency. Both intermediate and final outputs can be compressed. It is better to avoid compression techniques such as GZIP that cannot be split. Snappy is an alternative compression technique that can be split.
Partitioning large tables is good practice. This helps...