Let's design the map function now. A map function takes in a <key,value>
pair as input and emits one or more <key,value>
pairs. This function operates on the input value in isolation, that is, it has nothing to do with any other input values, which signifies that a map function is stateless. This is desired, as now map functions can be executed against many input data in parallel.
In our case, the map function can take one line in the access log as input value (key can be either null or an autoincrement integer), find the country to which the requesting IP belongs, and emit the output as <Country,1>
. So for our set of input lines to map function, we will have the following lines emitted as output:
Input access log |
Map output |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|