With both the batch and real-time infrastructure in place, we can focus on the analytics. First, we will take a look at the processing in Pig, and then we will translate the Pig script into a Storm topology.
For the batch analysis, we use Pig. The Pig script calculates the effectiveness of a campaign by computing the ratio between the distinct numbers of customers that have clicked-thru and the total number of impressions.
The Pig script is shown in the following code snippet:
click_thru_data = LOAD '../click_thru_data.txt' using PigStorage(' ') AS (cookie_id:chararray, campaign_id:chararray, product_id:chararray, click:chararray); click_thrus = FILTER click_thru_data BY click == 'true'; distinct_click_thrus = DISTINCT click_thrus; distinct_click_thrus_by_campaign = GROUP distinct_click_thrus BY campaign_id; count_of_click_thrus_by_campaign = FOREACH distinct_click_thrus_by_campaign GENERATE group, COUNT($1); -- dump...