Throughout this book, we will try to solve case study that will revolve around various concepts of Oozie.
One of the main use cases of Hadoop is ETL data processing.
Suppose we work for a large consulting company and have won a project to set up a Big Data cluster inside the customer data center. On a high level, the requirements are to set up an environment that will satisfy the following flow:
Get data from various sources in Hadoop (file-based loads and Sqoop-based loads).
Preprocess them with various scripts (Pig, Hive, and MapReduce).
Insert that data into Hive tables for use by analysts and data scientists.
Data scientists then write machine learning models (Spark).
We will use Oozie as our processing scheduling system to do all the preceding tasks. Since writing actual Hive, Sqoop, MapReduce, Pig, and Spark code is not in the scope of this book, I will not dive into explaining business logic for those. So I have kept them very simple.
In our architecture, we have one landing...