Another useful tool while analyzing data is a Scatter plot, which can be used to find the relationship between two measurements (dimensions). It plots the two dimensions against each other.
For example, this recipe analyzes the data to find the relationship between the size of the web pages and the number of hits received by the web page.
The following image shows the execution summary of this computation. Here, the map
function calculates and emits the message size (rounded to 1024 bytes) as the key and one
as the value. Then, the Reducer calculates the number of occurrences for each message size:
The following steps show how to use MapReduce to calculate the correlation between two datasets:
Download the weblog dataset from ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz and extract it.
Upload the extracted data to HDFS by running the following...