Here, we will develop a mini variant analysis pipeline with Airflow. The objective here is not to get the scientific part right—we cover that in other chapters—but to see how to create components with Airflow. Our mini-pipeline will download HapMap data, sub-sample at 1% and 10%, do a simple PCA, and draw it.
You will need PLINK installed. Remember that we are not using a conda
environment, so you have to make sure it is available for Airflow. We will define the following tasks:
- Downloading data
- Uncompressing it
- Sub-sampling at 10%
- Sub-sampling at 1%
- Computing PCA on the 1% sub-sample
- Charting the PCA
Our pipeline recipe will have two parts: the actual coding of the pipeline and making the pipeline actually execute.
The code for this can be found on Chapter08/pipelines/airflow/create_tasks.py
.