Book Image

Learning Cloudera Impala

By : Avkash Chauhan
Book Image

Learning Cloudera Impala

By: Avkash Chauhan

Overview of this book

<p>If you have always wanted to crunch billions of rows of raw data on Hadoop in a couple of seconds, then Cloudera Impala is the number one choice for you. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.</p> <p>In this practical, example-oriented book, you will learn everything you need to know about Cloudera Impala so that you can get started on your very own project. The book covers everything about Cloudera Impala from installation, administration, and query processing, all the way to connectivity with other third party applications. With this book in your hand, you will find yourself empowered to play with your data in Hadoop.</p> <p>As a reader of this book, you will learn about the origin of Impala and the technology behind it that allows it to run on thousands of machines. You will learn how to install, run, manage, and troubleshoot Impala in your own Hadoop cluster using the step-by-step guidance provided in the book. The book covers tenets of data processing such as loading data stored in Hadoop into Impala tables and querying data using Impala SQL statements, all with various code illustrations and a real-world example.</p> <p>The book is written to get you started with Impala by providing rich information so you can understand what Impala is, what it can do for you, and finally how you can use it to achieve your objective.</p>
Table of Contents (15 chapters)
Learning Cloudera Impala
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Preface

The changing landscape of Big Data and tools created for a relevant understanding of it have become very crucial in today's tech industry. The ability to understand and familarize with such tools allow individuals to creatively and intelligently take decisions with precision. If you've always wanted to crunch billions of rows of raw data on Hadoop in a couple of seconds, Cloudera Impala is, hands down, the top choice for you. Cloudera Impala provides a way to ingest various formats of data stored on Hadoop and provides a query engine to process it for gaining extremely important insight.

In this book, Learning Cloudera Impala, you are going to learn everything you need to know about Cloudera Impala so that you can start your project. The book covers Cloudera Impala from installation, administration, and query processing, all the way up to connectivity with other third-party applications. With this book in your hand, you will find yourself empowered to play with your data in Hadoop, and getting insight from your data will look like an interesting game to you.

What this book covers

Chapter 1, Getting Started with Impala, covers information on Impala, its core components, and its inner workings in details. We will cover the Impala execution architecture, including daemon and statestore, and how they interact together with the other components. Impala metadata and metastore are also discussed here to explain how Impala maintains its information. Finally, we will study various ways to interface Impala.

Chapter 2, The Impala Shell Commands and Interface, explains the various command options to interact with Impala, mainly using command-line references. In this chapter, we have covered the Impala command-line interface, explaining various ways Impala shell can connect to Impala daemon. Once the connection between Impala shell and impalad is established, we can use the various commands we discussed to connect to Impala.

Chapter 3, The Impala Query Language and Built-in Functions, teaches us how to make great use of Impala shell to interact with data by using the Impala Query Language, which is based on SQL, while providing a great degree of compatibility with HiveQL. Hive statements are based on SQL statements, and because Impala statements are based on SQL, we will learn several similarities and differences between them. Along with the Impala Query Language, we will also learn various Impala built-in functions using great examples.

Chapter 4, Impala Walkthrough with an Example, covers most of the learning from the previous chapter in detail. This way you can see a real-world scenario used with Impala and understand how and where to use Impala statements in real-world applications. I have created this detailed example by first creating automobile-specific datasets, and then using most of the SQL statements with the built-in functions we discussed in the previous chapter.

Chapter 5, Impala Administration and Performance Improvements, covers two important topics, Impala administration and performance improvements. Within the Impala administration section, I will first show you how you can administer Impala using Cloudera Manager. After that, I will teach you how to verify Impala-specific information for its correctness using a debugging web server. We will see Impala logs and Impala daemons through the statestore UI. The next part of Impala admin is about Impala High Availability, where we will learn the key traits for keeping Impala running in the event of a problem.

Chapter 6, Troubleshooting Impala, teaches you how to troubleshoot various Impala issues in different categories. Besides troubleshooting, in the latter part, I will show you how to utilize Impala logging to learn more about Impala execution, query processing, and possible issues. My objective is to provide you with some critical information on troubleshooting and log analysis, so you can manage the Impala cluster effectively and make it useful for yourself and your team.

Chapter 7, Advanced Impala Concepts, teaches you more about Impala; however, this information is more advance in nature to help you excel in data processing your project through Impala. I have described how Impala works side by side with MapReduce, without using it in the same cluster. I have also explained why Impala has an edge over Hive, even when using Hive as a key component, on which Impala is dependent. Finally, we cover details on using HBase with Impala and processing various Big Data input files on Hadoop with Impala.

Appendix, Technology Behind Impala and Integration with Third-party Applications, covers the detailed technology behind Impala and real-time query concepts with Impala. I have also described a few third-party data visualization applications, from Tableau, Zoomdata, and Microsoft Excel to Microstrategy, which connect with Impala to provide effective data visualization.

What you need for this book

You must have a Hadoop cluster (single-node experimental or multinode production) up and running to install Impala on it or already have Impala installed on it. Cloudera CDH 4.3 or above is preferred to install Impala. If you decide to install Cloudera Impala in your Hadoop Cluster, you can download it from the following link:

https://www.cloudera.com/content/support/en/downloads/download-components.html

If you do not have an active Hadoop cluster and still want to learn and try Impala, you have the option of downloading a Cloudera QuickStart Virtual Machine including everything from Cloudera, at the following link:

https://www.cloudera.com/content/support/en/downloads.html

Who this book is for

The book, is for those who really want to take full advantage of their Hadoop cluster by processing extremely large amounts of raw data in Hadoop at real-time speed. You may be using Hadoop as your raw data storage medium or using Hive to process your data. You will learn everything you need to start using Impala, to make the best use of your Hadoop cluster, and leverage any Business Intelligence tools you have in order to gain insight from your data using Impala.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Copy hdfs-site.xml and core-site.xml from Hadoop cluster to each Impala node into the Impala configuration folder, /etc/impala/conf."

Keywords in the text are shown as follows: "Impala statements support data manipulation statements similar to DML (Data Manipulation Language)."

Impala shell commands or Impala SQL statements are written as follows:

CREATE TABLE table_name (def data_type) 
PARTITIONED BY (partiton_name partition_type);
ALTER TABLE table_name ADD PARTITION (partition_type='definition');

When an Impala command or Impala SQL statement is used to show an example, either console output or query output is also displayed for complete understanding. In this scenario, either command or query is shown in bold as follows:

[Hadoop.testdomain:21000] > select count(distinct(make)) from automobiles;
Query finished, fetching results ...
+----------------------+
| count(distinct make) |
+----------------------+
| 10                   |
+----------------------+
Returned 1 row(s) in 0.48s 

Another example is as follows:

[cloudera@localhost ~]$ hdfs dfs -ls /user/cloudera/automobiles/
Found 1 items
-rw-r--r--   3 cloudera cloudera        985 2013-10-15 19:17 /user/cloudera/automobiles/automobiles.txt 

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to , and mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt Publishing book, we have a number of things to help you to get the most from your purchase.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt Publishing, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.