Apache Hive Cookbook

Apache Hive Cookbook

Overview of this book

Hive was developed by Facebook and later open sourced in Apache community. Hive provides SQL like interface to run queries on Big Data frameworks. Hive provides SQL like syntax also called as HiveQL that includes all SQL capabilities like analytical functions which are the need of the hour in today’s Big Data world. This book provides you easy installation steps with different types of metastores supported by Hive. This book has simple and easy to learn recipes for configuring Hive clients and services. You would also learn different Hive optimizations including Partitions and Bucketing. The book also covers the source code explanation of latest Hive version. Hive Query Language is being used by other frameworks including spark. Towards the end you will cover integration of Hive with these frameworks.

Apache Hive Cookbook

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Developing Hive

Introduction

Deploying Hive on a Hadoop cluster

Deploying Hive Metastore

Installing Hive

Configuring HCatalog

Understanding different components of Hive

Compiling Hive from source

Hive packages

Debugging Hive

Running Hive

Changing configurations at runtime

Services in Hive

Introducing HiveServer2

Understanding HiveServer2 properties

Configuring HiveServer2 high availability

Using HiveServer2 clients

Introducing the Hive metastore service

Configuring high availability of metastore service

Introducing Hue

Understanding the Hive Data Model

Introduction

Using numeric data types

Using string data types

Using Date/Time data types

Using miscellaneous data types

Using complex data types

Using operators

Partitioning

Partitioning a managed table

Partitioning an external table

Bucketing

Hive Data Definition Language

Introduction

Creating a database schema

Dropping a database schema

Altering a database schema

Using a database schema

Showing database schemas

Describing a database schema

Altering table properties

Creating views

Dropping views

Altering the view properties

Altering the view as select

Showing tables

Showing partitions

Show the table properties

Showing create table

HCatalog

WebHCat

Hive Data Manipulation Language

Introduction

Loading files into tables

Inserting data into Hive tables from queries

Inserting data into dynamic partitions

Writing data into files from queries

Enabling transactions in Hive

Inserting values into tables from SQL

Updating data

Deleting data

Hive Extensibility Features

Introduction

Serialization and deserialization formats and data types

Exploring views

Exploring indexes

Hive partitioning

Creating buckets in Hive

Analytics functions in Hive

Windowing in Hive

File formats

Joins and Join Optimization

Understanding the joins concept

Using a left/right/full outer join

Using a left semi join

Using a cross join

Using a map-side join

Using a bucket map join

Using a bucket sort merge map join

Using a skew join

Statistics in Hive

Bringing statistics in to Hive

Table and partition statistics in Hive

Column statistics in Hive

Top K statistics in Hive

Functions in Hive

Using built-in functions

Using the built-in User-defined Aggregation Function (UDAF)

Using the built-in User Defined Table Function (UDTF)

Creating custom User-Defined Functions (UDF)

Hive Tuning

Enabling predicate pushdown optimizations in Hive

Optimizations to reduce the number of map

Sampling

Hive Security

Securing Hadoop

Authorizing Hive

Configuring the SQL standards-based authorization

Authenticating Hive

Hive Integration with Other Frameworks

Working with Apache Spark

Working with Accumulo

Working with HBase

Working with Google Drill

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Installing Hive

We will now take a look at installing Hive along with all the prerequisites.

Getting ready

Let's download the stable version from one of the mirrors:

$ wget http://a.mbbsindia.com/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz

How to do it…

This can be achieved in three ways.

Hive with an embedded metastore

Once you have downloaded the Hive tar-ball file, installing and setting up a Hive is pretty simple and straightforward. Extract the compressed tar:

$tar –xzvf apache-hive-1.2.1-bin.tar.gz

Export the location where Hive is extracted as the environment variable HIVE_HOME:

$ cd  apache-hive-1.2.1-bin
$ export HIVE_HOME={{pwd}}

Hive has all its installation scripts in the $HIVE_HOME/bin directory. Export this location to the PATH environment variable so that you can run all scripts from any location directly from a command-line:

$ export PATH=$HIVE_HOME/bin:$PATH

Alternatively, if you want to set the Hive path permanently for the user, then make the entry of Hive environment variables in the .bashrc or .bash_profile files available or could be created in the user's home folder:

Add the following to ~/.bash_profile:

export HIVE_HOME=/home/hduser/apache-hive-1.2.1-bin
export PATH=$PATH:$HIVE_HOME/bin

Here, hduser is the name of user with which you have logged in and Hive-1.2.1 is the Hive directory extracted from the tar file. Run Hive from a terminal:
```
hive
```
Make sure that the Hive node has a connection to Hadoop cluster, which means Hive would be installed on any of the Hadoop nodes, or Hadoop configurations are available in the node's class path.
This installation uses the embedded Derby database and stores the data on the local filesystem. Only one Hive session can be open on the node.
If different users try to run the Hive shell, the second would get the Failed to start database 'metastore_db' error.

Run Hive queries for the datastore to test the installation:

hive> SHOW TABLES;
hive> CREATE TABLE sales(id INT, product String, age INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

Logs are generated per user bases in the /tmp/<usrename> folder.

Hive with a local metastore

Follow these steps to configure Hive with the local metastore. Here, we are using the MySQL database as a metastore:

Add following to ~/.bash_profile:
```
export HIVE_HOME=/home/hduser/apache-hive-1.2.1-bin
export PATH=$PATH:$HIVE_HOME/bin
```
Here, hduser is the user name, and apache-hive-1.2.1-bin is the Hive directory extracted from the tar file.
Install a SQL database such as MySQL on the same machine where you want to run Hive.
For the Ubuntu, MySQL could be installed by running the following command on the node's terminal:
```
sudo apt-get install mysql-server
```
In case of MySql, Hive needs the mysql-connector jar. Download the latest mysql-connector jar from http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.35.tar.gz and copy it to the lib folder of your Hive home directory.

Create a file, hive-site.xml, in the conf folder of Hive and add the following entries to it:

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hduser</value>
<description>user name for connecting to mysql server     
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>passwd</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>

Run Hive from the terminal:
```
hive
```

Note

There is a known "JLine" jar conflict issue with Hadoop 2.6.0 and Hive 1.2.1. If you are getting the error "unable to load class jline.terminal," you need to remove the older version of the jline jar from the yarn lib folder using the following command:

sudo rm -r $HADOOP_PREFIX/share/hadoop/yarn/lib/jline-0.9.94.jar

Hive with a remote metastore

Follow these steps to configure Hive with a remote metastore.

Download the latest version of Hive from http://a.mbbsindia.com/hive/hive-1.2.1/apache-hive-1.2.1-bin.tar.gz.

Extract the package:

tar –xzvf apache-hive-1.2.1-bin.tar.gz

Add the following to ~/.bash_profile:
```
sudo nano ~/.bash_profile
export HIVE_HOME=/home/hduser/apache-hive-1.2.1-bin
export PATH=$PATH:$HIVE_HOME/bin
```
Here, hduser is the user name and apache-hive-1.2.1-bin is the Hive directory extracted from the tar file.
Install a SQL database such as MySQL on a remote machine to be used for the metastore.
For Ubuntu, MySQL can be installed with the following command:
```
sudo apt-get install mysql-server
```
In the case of MySQL, Hive needs the mysql-connector jar file. Download the latest mysql-connector jar from http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.35.tar.gz and copy it to the lib folder of your Hive home directory.

Add the following entries to hive-site.xml:

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://<ip_of_remote_host>:3306/metastore_db?createDatabaseIfNotExist=true</value>
<description>metadata is stored in a MySQL server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value><description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hduser</value>
<description>user name for connecting to mysql server     
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>passwd</value>
<description>password for connecting to mysql server</description>
</property>
</configuration>

Start the Hive metastore interface:
```
bin/hive --service metastore &
```
Run Hive from the terminal:
```
hive
```
The Hive metastore interface by default listens at port 9083:
```
netstat -an | grep 9083
```
Start the Hive shell and make sure that the Hive Data Definition Language and Data Manipulation Language (DDL or DML) operations are working by creating tables in Hive.

Note

There is a known "JLine" jar conflict issue with Hadoop 2.6.0 and Hive 1.2.1. If you are getting the error "unable to load class jline.terminal," you need to remove the older version of jline jar from the yarn lib folder using the following command:

sudo rm -r $HADOOP_PREFIX/share/hadoop/yarn/lib/jline-0.9.94.jar

Apache Hive Cookbook

Apache Hive Cookbook

Overview of this book

Related Content you might be interested in

Current Title:

Apache Hive Cookbook

Installing Hive

Getting ready

How to do it…

Hive with an embedded metastore

Hive with a local metastore

Note

Hive with a remote metastore

Note