Another interesting view of a dataset is a histogram. A histogram makes sense only under a continuous dimension (for example, accessed time and file size). It groups the number of occurrences of an event into several groups in the dimension. For example, in this recipe, if we take the accessed time as the dimension, then we will group the accessed time by the hour.
The following figure shows the execution summary of this computation. The Mapper emits the hour of the access as the key and 1 as the value. Then, each reduce
function invocation receives all the occurrences of a certain hour of the day, and it calculates the total number of occurrences for that hour of the day.
The following steps show how to calculate and plot a histogram:
Download the weblog dataset from ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz and extract it.
Upload the...