Book Image

Learning Cloudera Impala

By : Avkash Chauhan
Book Image

Learning Cloudera Impala

By: Avkash Chauhan

Overview of this book

<p>If you have always wanted to crunch billions of rows of raw data on Hadoop in a couple of seconds, then Cloudera Impala is the number one choice for you. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.</p> <p>In this practical, example-oriented book, you will learn everything you need to know about Cloudera Impala so that you can get started on your very own project. The book covers everything about Cloudera Impala from installation, administration, and query processing, all the way to connectivity with other third party applications. With this book in your hand, you will find yourself empowered to play with your data in Hadoop.</p> <p>As a reader of this book, you will learn about the origin of Impala and the technology behind it that allows it to run on thousands of machines. You will learn how to install, run, manage, and troubleshoot Impala in your own Hadoop cluster using the step-by-step guidance provided in the book. The book covers tenets of data processing such as loading data stored in Hadoop into Impala tables and querying data using Impala SQL statements, all with various code illustrations and a real-world example.</p> <p>The book is written to get you started with Impala by providing rich information so you can understand what Impala is, what it can do for you, and finally how you can use it to achieve your objective.</p>
Table of Contents (15 chapters)
Learning Cloudera Impala
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Real-time query with Impala on Hadoop


Impala is marketed as a product that can do real-time queries on Hadoop by its developer, Cloudera. Impala is an open source implementation based on the previously mentioned Google Dremel technology that is available free for anyone to use. Impala is available as a package product that is free to use or can be compiled from its source, which can run queries in memory to make them real time. In some cases, depending on the type of data, if the Parquet file format is used as the input data source, it can expedite the query processing to a multifold speed.

Real-time query subscriptions with Impala

Cloudera provides a Real-time Query (RTQ) subscription as an add-on to a Cloudera Enterprise subscription. You can still use Impala as a free, open source product; however, opting for the RTQ subscription allows you to take advantage of the Cloudera paid service to extend its usability and resilience. By accepting the RTQ subscription, you can not only have access to Cloudera Technical support, but you can also work with the Impala development team to provide ample feedback to shape up the product design and implementation.