This recipe will use Pig to group the IP addresses
contained in the ip_to_country
dataset and count the number of IP addresses listed for each country.
Make sure you have access to a pseudo-distributed or fully-distributed Hadoop cluster with Apache Pig 0.9.2 installed on your client machine and on the environment path for the active user account. This recipe depends on having the ip-to-country
named dataset included in the book loaded into HDFS at the absolute path /input/weblog_ip/ip_to_country.txt
.
Carry out the following steps to perform a SELECT
and GROUP BY
operation in Pig:
Open a text editor of your choice, ideally one with SQL syntax highlighting.
Add the following inline creation syntax:
ip_countries = LOAD '/input/weblog_ip/ip_to_country.txt' AS (ip: chararray, country:chararray); country_grpd = GROUP ip_countries BY country; country_counts = FOREACH country_grpd GENERATE FLATTEN(group...