Data pipeline
In a typical data-processing system (batch or streaming) that consists of multiple processing steps that depend on each other in a tightly-coupled fashion, it is extremely important to trace the data lineage. One of the reasons for tracking the data lineage is to ensure accountability and understand exactly in which step the data was modified from one state to another.
Data lineage is important to keep track of to ward off any accountability issues as well as to avoid losing any operational metadata that may be important for auditing and other purposes.
Complex workflows are usually represented as a directed acyclic graph as represented in the Distributed Processing section. As you might have already guessed, the fundamental function on which DAGs are built is a data pipeline where output from one process is given as input to another process.
There can be two forms of information-exchange between two nodes in a DAG:
- Dumb exchange: In this case, the output from one node is given...