Book Image

Learning Cloudera Impala

By : Avkash Chauhan
Book Image

Learning Cloudera Impala

By: Avkash Chauhan

Overview of this book

<p>If you have always wanted to crunch billions of rows of raw data on Hadoop in a couple of seconds, then Cloudera Impala is the number one choice for you. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.</p> <p>In this practical, example-oriented book, you will learn everything you need to know about Cloudera Impala so that you can get started on your very own project. The book covers everything about Cloudera Impala from installation, administration, and query processing, all the way to connectivity with other third party applications. With this book in your hand, you will find yourself empowered to play with your data in Hadoop.</p> <p>As a reader of this book, you will learn about the origin of Impala and the technology behind it that allows it to run on thousands of machines. You will learn how to install, run, manage, and troubleshoot Impala in your own Hadoop cluster using the step-by-step guidance provided in the book. The book covers tenets of data processing such as loading data stored in Hadoop into Impala tables and querying data using Impala SQL statements, all with various code illustrations and a real-world example.</p> <p>The book is written to get you started with Impala by providing rich information so you can understand what Impala is, what it can do for you, and finally how you can use it to achieve your objective.</p>
Table of Contents (15 chapters)
Learning Cloudera Impala
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

The Impala execution architecture


Previously we discussed the Impala daemon, statestore, and metastore in detail to understand how they work together. Essentially, Impala daemons receive queries from a variety of sources and distribute the query load to Impala daemons running on other nodes. While doing so, it interacts with the statestore for node-specific updates and accesses the metastore, either stored in the centralized database or in the local cache. Now to complete the Impala execution, we will discuss how Impala interacts with other components, that is, Hive, HDFS, and HBase.

Working with Apache Hive

We have already discussed earlier the Impala metastore using the centralized database as a metastore, and Hive also uses the same MySQL or PostgreSQL database for the same kind of data. Impala provides the same SQL-like query interface used in Apache Hive. Since both Impala and Hive share the same database as a metastore, Impala can access Hive-specific table definitions if the Hive table definition uses the same file format, compression codecs, and Impala-supported data types for their column values.

Apache Hive provides various kinds of file-type processing support to Impala. When using formats other than a text file, that is, RCFile, Avro, and SequenceFile, the data must be loaded through Hive first and then Impala can query the data from these file formats. Impala can perform a read operation on more types of data using the SELECT statement and then perform a write operation using the INSERT statement. The ANALYZE TABLE statement in Hive generates useful table and column statistics and Impala uses these valuable statistics to optimize the queries.

Working with HDFS

Impala table data are actually regular data files stored in HDFS and Impala uses HDFS as its primary data storage medium. As soon as a data file or a collection of files is available in a specific folder of a new table, Impala reads all of the files regardless of their names, and new data is included in files with the name controlled by Impala. HDFS provides data redundancy through the replication factor and relies on such redundancy to access data on other DataNodes in case it is not available on a specific DataNode. We have already learned earlier that Impala also maintains the information on the physical location of the blocks about data files in HDFS, which helps data access in case of node failure.

Working with HBase

HBase is a distributed, scalable, big data storage system that provides random, real-time read and write access to data stored on HDFS. HBase, a database storage system, sits on top of HDFS; however, like other traditional database storage systems, HBase does not provide built-in SQL support. Third-party applications can provide such functionality.

To use HBase, first the user defines tables in Impala and then maps them to the equivalent HBase tables. Once a table relationship is established, users can submit queries into the HBase table through Impala. Join operations can also be formed including HBase and Impala tables.

Note

To learn more about using HBase with Impala, please visit the Cloudera website at the following link, for extensive documentation:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html