So far, we have only talked about real-time analysis of tracing data. Occasionally, it may be useful to run the same analysis over historical trace data, assuming it is within your data store's retention periods. As an example, if we come up with a new type of aggregation, the streaming job we discussed earlier will only start generating it for new data, so we would have no basis for comparison.
Fortunately, the big data frameworks are very flexible and provide a lot of ways to source the data for analysis, including reading it from databases, or HDFS, or other types of warm and cold storage. In particular, Flink's documentation says it is fully compatible with Hadoop MapReduce APIs and can use Hadoop input formats as a data source. So, we can potentially use the same job we implemented here and just give it a different data source in order to process historical datasets .
While these integrations are possible, as of the time of writing, there are not very many open source...