Book Image

Storm Blueprints: Patterns for Distributed Real-time Computation

Book Image

Storm Blueprints: Patterns for Distributed Real-time Computation

Overview of this book

Table of Contents (17 chapters)
Storm Blueprints: Patterns for Distributed Real-time Computation
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Performing the analytics


With both the batch and real-time infrastructure in place, we can focus on the analytics. First, we will take a look at the processing in Pig, and then we will translate the Pig script into a Storm topology.

Executing the batch analysis

For the batch analysis, we use Pig. The Pig script calculates the effectiveness of a campaign by computing the ratio between the distinct numbers of customers that have clicked-thru and the total number of impressions.

The Pig script is shown in the following code snippet:

click_thru_data = LOAD '../click_thru_data.txt' using PigStorage(' ')
  AS (cookie_id:chararray,
      campaign_id:chararray,
      product_id:chararray,
      click:chararray);

click_thrus = FILTER click_thru_data BY click == 'true';
distinct_click_thrus = DISTINCT click_thrus;
distinct_click_thrus_by_campaign = GROUP distinct_click_thrus BY campaign_id;
count_of_click_thrus_by_campaign = FOREACH distinct_click_thrus_by_campaign GENERATE group, COUNT($1);
-- dump...