Apache Hive Essentials

Book Image

Apache Hive Essentials

By : Dayong Du

Book Image

Apache Hive Essentials

By: Dayong Du

Overview of this book

Apache Hive Essentials

Apache Hive Essentials

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Overview of Big Data and Hive

Overview of Big Data and Hive

A short history

Introducing big data

Relational and NoSQL database versus Hadoop

Batch, real-time, and stream processing

Overview of the Hadoop ecosystem

Setting Up the Hive Environment

Setting Up the Hive Environment

Installing Hive from Apache

Installing Hive from vendor packages

Starting Hive in the cloud

Using the Hive command line and Beeline

The Hive-integrated development environment

Data Definition and Description

Data Definition and Description

Understanding Hive data types

Data type conversions

Hive Data Definition Language

Hive internal and external tables

Hive partitions

Data Selection and Scope

Data Selection and Scope

The SELECT statement

The INNER JOIN statement

The OUTER JOIN and CROSS JOIN statements

Special JOIN – MAPJOIN

Set operation – UNION ALL

Data Manipulation

Data Manipulation

Data exchange – LOAD

Data exchange – INSERT

Data exchange – EXPORT and IMPORT

Operators and functions

Data Aggregation and Sampling

Data Aggregation and Sampling

Basic aggregation – GROUP BY

Advanced aggregation – GROUPING SETS

Advanced aggregation – ROLLUP and CUBE

Aggregation condition – HAVING

Analytic functions

Performance Considerations

Performance Considerations

Performance utilities

Design optimization

Data file optimization

Job and query optimization

Extensibility Considerations

Extensibility Considerations

User-defined functions

Security Considerations

Security Considerations

Working with Other Tools

Working with Other Tools

JDBC / ODBC connector

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Sampling

When data volume is extra large, we may need to find a subset of data to speed up data analysis. Here it comes to a technique used to select and analyze a subset of data in order to identify patterns and trends. In Hive, there are three ways of sampling data: random sampling, bucket table sampling, and block sampling.

Random sampling uses the RAND() function and LIMIT keyword to get the sampling of data as shown in the following example. The DISTRIBUTE and SORT keywords are used here to make sure the data is also randomly distributed among mappers and reducers efficiently. The ORDER BY RAND() statement can also achieve the same purpose, but the performance is not good:

SELECT * FROM <Table_Name> DISTRIBUTE BY RAND() SORT BY RAND()
LIMIT <N rows to sample>;

Bucket table sampling is a special sampling optimized for bucket tables as shown in the following syntax and example. The colname value specifies the column where to sample the data. The RAND() function can also be...