Apache Hive Essentials

Book Image

Apache Hive Essentials

By : Dayong Du

Book Image

Apache Hive Essentials

By: Dayong Du

Overview of this book

Apache Hive Essentials

Apache Hive Essentials

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Overview of Big Data and Hive

Overview of Big Data and Hive

A short history

Introducing big data

Relational and NoSQL database versus Hadoop

Batch, real-time, and stream processing

Overview of the Hadoop ecosystem

Setting Up the Hive Environment

Setting Up the Hive Environment

Installing Hive from Apache

Installing Hive from vendor packages

Starting Hive in the cloud

Using the Hive command line and Beeline

The Hive-integrated development environment

Data Definition and Description

Data Definition and Description

Understanding Hive data types

Data type conversions

Hive Data Definition Language

Hive internal and external tables

Hive partitions

Data Selection and Scope

Data Selection and Scope

The SELECT statement

The INNER JOIN statement

The OUTER JOIN and CROSS JOIN statements

Special JOIN – MAPJOIN

Set operation – UNION ALL

Data Manipulation

Data Manipulation

Data exchange – LOAD

Data exchange – INSERT

Data exchange – EXPORT and IMPORT

Operators and functions

Data Aggregation and Sampling

Data Aggregation and Sampling

Basic aggregation – GROUP BY

Advanced aggregation – GROUPING SETS

Advanced aggregation – ROLLUP and CUBE

Aggregation condition – HAVING

Analytic functions

Performance Considerations

Performance Considerations

Performance utilities

Design optimization

Data file optimization

Job and query optimization

Extensibility Considerations

Extensibility Considerations

User-defined functions

Security Considerations

Security Considerations

Working with Other Tools

Working with Other Tools

JDBC / ODBC connector

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Job and query optimization

Job and query optimization covers experience and skills to improve performance in the area of job-running mode, JVM reuse, job parallel running, and query optimizations in JOIN.

Local mode

Hadoop can run in standalone, pseudo-distributed, and fully distributed mode. Most of the time, we need to configure Hadoop to run in fully distributed mode. When the data to process is small, it is an overhead to start distributed data processing since the launching time of the fully distributed mode takes more time than the job processing time. Since Hive 0.7.0, Hive supports automatic conversion of a job to run in local mode with the following settings:

jdbc:hive2://> SET hive.exec.mode.local.auto=true; --default false
jdbc:hive2://> SET hive.exec.mode.local.auto.inputbytes.max=50000000;
jdbc:hive2://> SET hive.exec.mode.local.auto.input.files.max=5;
--default 4

A job must satisfy the following conditions to run in the local mode:

The total input size of the job is lower...