The technology behind Impala is revolutionary and inspired by a Google research project named Dremel. Dremel is a scalable ad hoc query-based analysis system for read-only nested data. Dremel-based implementations can run aggregation queries over trillions of rows in seconds by combining multilevel executing trees and columnar data layout. It does not use MapReduce as the core; instead it complements MapReduce. Impala is considered to be a native Massive Parallel Processing query engine running on Apache Hadoop. Depending on the type of query and configuration, Impala excels in data processing performance over traditional database applications on Hadoop, such as Hive, and processing frameworks, such as MapReduce, due to the following key reasons:
Distributed, scalable aggregation algorithms.
Specialized hardware configuration, such as reducing CPU load, which increases aggregate I/O bandwidth.
Using the columnar binary storage format on Hadoop, which adds speed to query processing. This is done by taking advantage of Parquet file types as an input source.
Impala extends its reach beyond Dremel and provides support for various other popular file formats, making its availability and reach beyond Parquet to multifold users.
Impala uses the available memory on a machine as a table cache, which mean queries always process the data that is available in the cache, making processing super fast by speeding their execution up to 90 times faster than conventional processing when data is read from a disk.
You can learn more on Google Dremel by referring to a research paper at the following URL: