Frequency distribution is the number of hits received by each URL sorted in ascending order. We already calculated the number of hits for each URL in the earlier recipe. This recipe will sort that list based on the number of hits.
This recipe assumes that you have a working Hadoop installation. This recipe will use the results from the Performing GROUP BY using MapReduce recipe of this chapter. Follow this recipe if you have not done so already.
The following steps show how to calculate frequency distribution using MapReduce:
Run the MapReduce job using the following command. We assume that the
data/hit-count-out
path contains the output of theHitCountMapReduce
computation of the previous recipe:$ bin/hadoop jar hcb-c5-samples.jar \ chapter5.weblog.FrequencyDistributionMapReduce \ data/hit-count-out data/freq-dist-out
Read the results by running the following command:
$ hdfs dfs -cat data/freq-dist...