Apache Hive Essentials

Book Image

Apache Hive Essentials

By : Dayong Du

Book Image

Apache Hive Essentials

By: Dayong Du

Overview of this book

Apache Hive Essentials

Apache Hive Essentials

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Overview of Big Data and Hive

Overview of Big Data and Hive

A short history

Introducing big data

Relational and NoSQL database versus Hadoop

Batch, real-time, and stream processing

Overview of the Hadoop ecosystem

Setting Up the Hive Environment

Setting Up the Hive Environment

Installing Hive from Apache

Installing Hive from vendor packages

Starting Hive in the cloud

Using the Hive command line and Beeline

The Hive-integrated development environment

Data Definition and Description

Data Definition and Description

Understanding Hive data types

Data type conversions

Hive Data Definition Language

Hive internal and external tables

Hive partitions

Data Selection and Scope

Data Selection and Scope

The SELECT statement

The INNER JOIN statement

The OUTER JOIN and CROSS JOIN statements

Special JOIN – MAPJOIN

Set operation – UNION ALL

Data Manipulation

Data Manipulation

Data exchange – LOAD

Data exchange – INSERT

Data exchange – EXPORT and IMPORT

Operators and functions

Data Aggregation and Sampling

Data Aggregation and Sampling

Basic aggregation – GROUP BY

Advanced aggregation – GROUPING SETS

Advanced aggregation – ROLLUP and CUBE

Aggregation condition – HAVING

Analytic functions

Performance Considerations

Performance Considerations

Performance utilities

Design optimization

Data file optimization

Job and query optimization

Extensibility Considerations

Extensibility Considerations

User-defined functions

Security Considerations

Security Considerations

Working with Other Tools

Working with Other Tools

JDBC / ODBC connector

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

SerDe

SerDe stands for Serializer and Deserializer. It is the technology that Hive uses to process records and map them to column data types in Hive tables. To explain the scenario of using SerDe, we need to understand how Hive reads and writes data.

The process to read data is as follows:

Data is read from HDFS.
Data is processed by the INPUTFORMAT implementation, which defines the input data split and key/value records. In Hive, we can use CREATE TABLE ... STORED AS <FILE_FORMAT> (see Chapter 7, Performance Considerations, for available file formats) to specify which INPUTFORMAT it reads from.
The Java Deserializer class defined in SerDe is called to format the data into a record that maps to column and data types in a table.

For an example of reading data, we can use JSON SerDe to read the TEXTFILE format data from HDFS and translate each row of the JSON attribute and value to rows in Hive tables with the correct schema.

The process to write data is as follows:

Data (such as using an INSERT...