Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Pulling it all together


Let's review what we've discussed until now and how we can use Oozie to build a sophisticated series of workflows that implement an approach to data life cycle management by putting together all the discussed techniques.

First, it's important to define clear responsibilities and implement parts of the system using good design and separation of concern principles. By applying this, we end up with several different workflows:

  • A subworkflow to ensure the environment (mainly HDFS and Hive metadata) is correctly configured

  • A subworkflow to perform data validation

  • The main workflow that triggers both the preceding subworkflows and then pulls new data through a multistep ingest pipeline

  • A coordinator that executes the preceding workflows every 10 minutes

  • A second coordinator that ingests reference data that will be useful to the application pipeline

We also define all our tables with Avro schemas and use them wherever possible to help manage schema evolution and changing data formats...