Learning Cloudera Impala

Learning Cloudera Impala

By : Avkash Chauhan

Buy this Book

Learning Cloudera Impala

By: Avkash Chauhan

Buy this Book

Overview of this book

If you have always wanted to crunch billions of rows of raw data on Hadoop in a couple of seconds, then Cloudera Impala is the number one choice for you. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries. In this practical, example-oriented book, you will learn everything you need to know about Cloudera Impala so that you can get started on your very own project. The book covers everything about Cloudera Impala from installation, administration, and query processing, all the way to connectivity with other third party applications. With this book in your hand, you will find yourself empowered to play with your data in Hadoop. As a reader of this book, you will learn about the origin of Impala and the technology behind it that allows it to run on thousands of machines. You will learn how to install, run, manage, and troubleshoot Impala in your own Hadoop cluster using the step-by-step guidance provided in the book. The book covers tenets of data processing such as loading data stored in Hadoop into Impala tables and querying data using Impala SQL statements, all with various code illustrations and a real-world example. The book is written to get you started with Impala by providing rich information so you can understand what Impala is, what it can do for you, and finally how you can use it to achieve your objective.

Learning Cloudera Impala

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Getting Started with Impala

Impala requirements

Installing Impala

Configuring Impala after installation

Impala core components

The Impala execution architecture

Impala security

Impala security guidelines for a higher level of protection

Summary

The Impala Shell Commands and Interface

Using Cloudera Manager for Impala

Launching Impala shell

Connecting impala-shell to the remotely located impalad daemon

Impala-shell command-line options with brief explanations

Impala-shell command reference

Summary

The Impala Query Language and Built-in Functions

Impala SQL language statements

Data types

Operators

Functions

Clauses

Query-specific SQL statements in Impala

Defining VIEWS in Impala

Loading data from HDFS using the LOAD DATA statement

Comments in Impala SQL statements

Built-in function support in Impala

Unsupported SQL statements in Impala

Summary

Impala Walkthrough with an Example

Creating an example scenario

Commands for loading data into Impala tables

Launching the Impala shell

SQL queries against the example database

SQL join operation with the example database

Summary

Impala Administration and Performance Improvements

Impala administration

Impala High Availability

Single point of failure in Impala

Improving performance

Testing query performance

Choosing an appropriate file format and compression type for better performance

Fine-tuning Impala performance

Summary

Troubleshooting Impala

Troubleshooting various problems

Using Cloudera Manager to troubleshoot problems

Summary

Advanced Impala Concepts

Impala and MapReduce

Impala and Hive

Impala and Extract, Transform, Load (ETL)

Why Impala is faster than Hive in query processing

Impala processing strategy

Impala and HBase

File formats and compression types supported in Impala

Processing different file and compression types in Impala

The unsupported features in Impala

Impala resources

Summary

Technology Behind Impala and Integration with Third-party Applications

Technology behind Impala

Data visualization using Impala

Real-time query with Impala on Hadoop

What is new in Impala 1.2.0 (Beta)

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What is new in Impala 1.2.0 (Beta)

At the time of writing this book, Impala 1.2.0 Beta was available to test with CDH 5.0. Impala 1.2.0 has several features visible to users; however, lots of other features are under the hood to improve performance, security, and flexibility. A few notable features are as follows:

Impala supports user-defined functions (UDF) natively, and users can write scalar UDF and user-defined aggregate functions (UDA).
Functions written in C++ and Java can work with Impala as they are.
Currently, REFRESH statements are required after every use of table-specific SQL commands, such as CREATE TABLE, ALTER TABLE, DROP TABLE, INSERT, and LOAD DATA, to update information to the whole cluster. Impala now has an automatic synchronization mechanism, so there is no need for REFRESH or INVALIDATE METADATA SQL commands. With the automatic synchronization mechanism, a newly created service takes charge of updating table or metadata specific information to the whole Impala cluster as the changes are available.
Another big update is integration with YARN, in which Impala uses the YARN resource management framework for adequate resource management during query processing.

Tip

According to Cloudera, Impala 1.2.0 Beta is packaged with Cloudera CDH 5.0 (Beta) and only works with Cloudera CDH 5.0. Please visit the following URL for more details:

http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/1.2.0-beta/Cloudera-Impala-Release-Notes/cirn_new_features.html

Learning Cloudera Impala

By : Avkash Chauhan

Learning Cloudera Impala

By: Avkash Chauhan

Overview of this book

Related Content you might be interested in

Current Title:

Learning Cloudera Impala

What is new in Impala 1.2.0 (Beta)

Tip