Book Image

Apache Oozie Essentials

By : Jagat Singh
Book Image

Apache Oozie Essentials

By: Jagat Singh

Overview of this book

As more and more organizations are discovering the use of big data analytics, interest in platforms that provide storage, computation, and analytic capabilities is booming exponentially. This calls for data management. Hadoop caters to this need. Oozie fulfils this necessity for a scheduler for a Hadoop job by acting as a cron to better analyze data. Apache Oozie Essentials starts off with the basics right from installing and configuring Oozie from source code on your Hadoop cluster to managing your complex clusters. You will learn how to create data ingestion and machine learning workflows. This book is sprinkled with the examples and exercises to help you take your big data learning to the next level. You will discover how to write workflows to run your MapReduce, Pig ,Hive, and Sqoop scripts and schedule them to run at a specific time or for a specific business requirement using a coordinator. This book has engaging real-life exercises and examples to get you in the thick of things. Lastly, you’ll get a grip of how to embed Spark jobs, which can be used to run your machine learning models on Hadoop. By the end of the book, you will have a good knowledge of Apache Oozie. You will be capable of using Oozie to handle large Hadoop workflows and even improve the availability of your Hadoop environment.
Table of Contents (16 chapters)
Apache Oozie Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

HCatalog


HCatalog provides the table and storage management layer for Hadoop. It brings various tools in the Hadoop ecosystem together. Using HCatalog interface, different tools like Hive, Pig, and MapReduce can read and write data on Hadoop. All of them can use the shared schema and datatypes provided by HCatalog. Having shared the mechanism of reading and writing makes it easy to consume the output of one tool in the other one.

So how does HCatalog come in section of Datasets? So far, we have seen the HDFS folder-based Datasets in which based on some success flag, we come to know that data is available. Using HCatalog-based Datasets, we can trigger Oozie jobs based on time when data in a given Hive partition becomes available for consumption. This takes Oozie to the next level of job dependency, where we can consume data as and when it is available in Hive.

To quickly see an example of interoperability, let's see how Pig can use Hive tables and how HCatalog brings all tools together. Read...