Apache Hive Cookbook

Book Image

Apache Hive Cookbook

Book Image

Apache Hive Cookbook

Overview of this book

Hive was developed by Facebook and later open sourced in Apache community. Hive provides SQL like interface to run queries on Big Data frameworks. Hive provides SQL like syntax also called as HiveQL that includes all SQL capabilities like analytical functions which are the need of the hour in today’s Big Data world. This book provides you easy installation steps with different types of metastores supported by Hive. This book has simple and easy to learn recipes for configuring Hive clients and services. You would also learn different Hive optimizations including Partitions and Bucketing. The book also covers the source code explanation of latest Hive version. Hive Query Language is being used by other frameworks including spark. Towards the end you will cover integration of Hive with these frameworks.

Apache Hive Cookbook

Apache Hive Cookbook

Credits

About the Authors

About the Authors

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Developing Hive

Developing Hive

Deploying Hive on a Hadoop cluster

Deploying Hive Metastore

Installing Hive

Configuring HCatalog

Understanding different components of Hive

Compiling Hive from source

Changing configurations at runtime

Services in Hive

Services in Hive

Introducing HiveServer2

Understanding HiveServer2 properties

Configuring HiveServer2 high availability

Using HiveServer2 clients

Introducing the Hive metastore service

Configuring high availability of metastore service

Introducing Hue

Understanding the Hive Data Model

Understanding the Hive Data Model

Using numeric data types

Using string data types

Using Date/Time data types

Using miscellaneous data types

Using complex data types

Using operators

Partitioning a managed table

Partitioning an external table

Hive Data Definition Language

Hive Data Definition Language

Creating a database schema

Dropping a database schema

Altering a database schema

Using a database schema

Showing database schemas

Describing a database schema

Creating tables

Dropping tables

Truncating tables

Renaming tables

Altering table properties

Altering the view properties

Altering the view as select

Showing partitions

Show the table properties

Showing create table

Hive Data Manipulation Language

Hive Data Manipulation Language

Loading files into tables

Inserting data into Hive tables from queries

Inserting data into dynamic partitions

Writing data into files from queries

Enabling transactions in Hive

Inserting values into tables from SQL

Hive Extensibility Features

Hive Extensibility Features

Serialization and deserialization formats and data types

Exploring views

Exploring indexes

Hive partitioning

Creating buckets in Hive

Analytics functions in Hive

Windowing in Hive

Joins and Join Optimization

Joins and Join Optimization

Understanding the joins concept

Using a left/right/full outer join

Using a left semi join

Using a cross join

Using a map-side join

Using a bucket map join

Using a bucket sort merge map join

Using a skew join

Statistics in Hive

Statistics in Hive

Bringing statistics in to Hive

Table and partition statistics in Hive

Column statistics in Hive

Top K statistics in Hive

Functions in Hive

Functions in Hive

Using built-in functions

Using the built-in User-defined Aggregation Function (UDAF)

Using the built-in User Defined Table Function (UDTF)

Creating custom User-Defined Functions (UDF)

Hive Tuning

Enabling predicate pushdown optimizations in Hive

Optimizations to reduce the number of map

Hive Security

Securing Hadoop

Authorizing Hive

Configuring the SQL standards-based authorization

Authenticating Hive

Hive Integration with Other Frameworks

Hive Integration with Other Frameworks

Working with Apache Spark

Working with Accumulo

Working with HBase

Working with Google Drill

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Hive packages

The following are the various sections included in Hive packages.

Getting ready

Hive source consists of different modules categorized by the features they provide or as a submodule of some other module.

How to do it...

The following is the list of Hive modules and their usage in Hive:

accumulo-handler: Apache accumulo is a distributed key-value datastore based on Google Big Table. This package includes the components responsible for mapping the Hive table to the accumulo table. AccumuloStorageHandler and AccumuloPredicateHandler are the main classes responsible for mapping tables. For more information, refer to the official integration documentation available at https://cwiki.apache.org/confluence/display/Hive/AccumuloIntegration.
ant: This tool is used to build earlier versions of Hive source. Ant is also needed to configure the Hive Web Interface server.
beeline: A Hive client used to connect with HiveServer2 and run Hive queries.
bin: This package includes scripts to start Hive clients and services.
cli: This is a Hive Command-line Interface implementation.
common: These are utility classes used by other modules.
conf: This contains default configurations and uses defined configuration objects.
contrib: This contains Serdes, generic UDF, and fileformat contributed by third parties to Hive.
hbase-handler: This module allows Hive SQL statements to access HBase tables for SELECT and INSERT commands. It also provides interfaces to access HBase and Hive tables for join and union in a single query. More information is available at https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration.
hcatalog: This is a table management framework that helps other frameworks such as Pig or MapReduce to access the Hive metastore and table schema.
hwi: This module provides an implementation of a web interface to run Hive queries. Also, the WebHCat APIs provide REST APIs to access the Hive metastore.
Jdbc: This is a connector that accepts JDBC connections and calls to execute Hive queries on the cluster.
Metastore: This is the API that provides access to metastore entities including database, table, schema, and serdes.
odbc: This module implements the Open Database Connectivity (ODBC) API, enabling ODBC applications to connect and execute queries over Hive.
ql: This module provides an interface to clients that checks for query semantics and provides an implementation for driver, parser, and query planner.
Serde: This module has an implementation of serializer and deserializer used by Hive to read and write data. It helps in validating and parsing record and field types.
shims: This is the module that transparently intercepts and modifies calls to the Hive API, usually for compatibility purposes.
spark-client: This module provides an interface to execute Hive SQLs on a Spark framework.