Often we require multiple MapReduce applications to be executed in a workflow-like manner to achieve our objective. Hadoop ControlledJob
and JobControl
classes provide a mechanism to execute a simple workflow graph of MapReduce jobs by specifying the dependencies between them.
In this recipe, we execute the log-grep
MapReduce computation followed by the log-analysis
MapReduce computation on an HTTP server log dataset. The log-grep
computation filters the input data based on a regular expression. The log-analysis
computation analyses the filtered data. Hence, the log-analysis
computation is dependent on the log-grep
computation. We use the ControlledJob
class to express this dependency and use the JobControl
class to execute both the related MapReduce computations.