Workflow Management with Airflow
So far, we have learned how to create data pipelines and different types of workflows, including linear and non-linear ones. We define and implement workflows in Python scripts and use Bash to automate workflows. However, that is not enough for us to be able to manage workflows on a large scale. We are going to take workflow management to the next level by solving the following problems:
- Can we find a standardized way to define workflow dependency instead of writing a customized Bash script?
- Can we define data operations with a consistent interface instead of writing a Python program with a customized CLI?
- Can we have a standardized way to log the pipeline's running status?
- Can we monitor a running workflow? Can we schedule workflows?
The answer to all of these problems is Airflow. Airflow is a horizontally scalable, distributed workflow management system that allows us to specify complex workflows using Python code...