Apache Hive Essentials

Book Image

Apache Hive Essentials

By : Dayong Du

Book Image

Apache Hive Essentials

By: Dayong Du

Overview of this book

Apache Hive Essentials

Apache Hive Essentials

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Overview of Big Data and Hive

Overview of Big Data and Hive

A short history

Introducing big data

Relational and NoSQL database versus Hadoop

Batch, real-time, and stream processing

Overview of the Hadoop ecosystem

Setting Up the Hive Environment

Setting Up the Hive Environment

Installing Hive from Apache

Installing Hive from vendor packages

Starting Hive in the cloud

Using the Hive command line and Beeline

The Hive-integrated development environment

Data Definition and Description

Data Definition and Description

Understanding Hive data types

Data type conversions

Hive Data Definition Language

Hive internal and external tables

Hive partitions

Data Selection and Scope

Data Selection and Scope

The SELECT statement

The INNER JOIN statement

The OUTER JOIN and CROSS JOIN statements

Special JOIN – MAPJOIN

Set operation – UNION ALL

Data Manipulation

Data Manipulation

Data exchange – LOAD

Data exchange – INSERT

Data exchange – EXPORT and IMPORT

Operators and functions

Data Aggregation and Sampling

Data Aggregation and Sampling

Basic aggregation – GROUP BY

Advanced aggregation – GROUPING SETS

Advanced aggregation – ROLLUP and CUBE

Aggregation condition – HAVING

Analytic functions

Performance Considerations

Performance Considerations

Performance utilities

Design optimization

Data file optimization

Job and query optimization

Extensibility Considerations

Extensibility Considerations

User-defined functions

Security Considerations

Security Considerations

Working with Other Tools

Working with Other Tools

JDBC / ODBC connector

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Hive partitions

By default, a simple query in Hive scans the whole Hive table. This slows down the performance when querying a large-size table. The issue could be resolved by creating Hive partitions, which is very similar to what's in the RDBMS. In Hive, each partition corresponds to a predefined partition column(s) and stores it as a subdirectory in the table's directory in HDFS. When the table gets queried, only the required partitions (directory) of data in the table are queried, so the I/O and time of query is greatly reduced. It is very easy to implement Hive partitions when the table is created and check the partitions created, as follows:

--
Create partitions when creating tables
jdbc:hive2://> CREATE TABLE employee_partitioned
. . . . . . .> (
. . . . . . .>   name string,
. . . . . . .>   work_place ARRAY<string>,
. . . . . . .>   sex_age STRUCT<sex:string,age:int>,
. . . . . . .>   skills_score MAP<string,int>,
. . . . . . .>   depart_title...