Book Image

Apache Hive Essentials. - Second Edition

By : Dayong Du
Book Image

Apache Hive Essentials. - Second Edition

By: Dayong Du

Overview of this book

In this book, we prepare you for your journey into big data by frstly introducing you to backgrounds in the big data domain, alongwith the process of setting up and getting familiar with your Hive working environment. Next, the book guides you through discovering and transforming the values of big data with the help of examples. It also hones your skills in using the Hive language in an effcient manner. Toward the end, the book focuses on advanced topics, such as performance, security, and extensions in Hive, which will guide you on exciting adventures on this worthwhile big data journey. By the end of the book, you will be familiar with Hive and able to work effeciently to find solutions to big data problems
Table of Contents (12 chapters)

Overview of the Hadoop ecosystem

Hadoop was first released by Apache in 2011 as Version 1.0.0, which only contained HDFS and MapReduce. Hadoop was designed as both a computing (MapReduce) and storage (HDFS) platform from the very beginning. With the increasing need for big data analysis, Hadoop attracts lots of other software to resolve big data questions and merges into a Hadoop-centric big data ecosystem. The following diagram gives a brief overview of the Hadoop big data ecosystem in Apache stack:

Apache Hadoop ecosystem

In the current Hadoop ecosystem, HDFS is still the major option when using hard disk storage, and Alluxio provides virtually distributed memory alternatives. On top of HDFS, the Parquet, Avro, and ORC data formats could be used along with a snappy compression algorithm for computing and storage optimization. Yarn, as the first Hadoop general-purpose resource manager, is designed for better resource management and scalability. Spark and Ignite, as in-memory computing engines, are able to run on Yarn to work with Hadoop closely, too.

On the other hand, Kafka, Flink, and Storm are dominating stream processing. HBase is a leading NoSQL database, especially on Hadoop clusters. For machine learning, it comes to Spark MLlib and Madlib along with a new Mahout. Sqoop is still one of the leading tools for exchanging data between Hadoop and relational databases. Flume is a matured, distributed, and reliable log-collecting tool to move or collect data to HDFS. Impala and Drill are able to launch interactive SQL queries directly against the data on Hadoop. In addition, Hive over Spark/Tez along with Live Long And Process (LLAP) offers users the ability to run a query in long-lived processes on different computing frameworks, rather than MapReduce, with in-memory data caching. As a result, Hive is playing more important roles in the ecosystem than ever. We are also glad to see that Ambari as a new generation of cluster-management tools provides more powerful cluster management and coordination in addition to Zookeeper. For scheduling and workflow management, we can either use Airflow or Oozie. Finally, we have an open source governance and metadata service come into the picture, Altas, which empowers the compliance and lineage of big data in the ecosystem.