Book Image

Storm Blueprints: Patterns for Distributed Real-time Computation

Book Image

Storm Blueprints: Patterns for Distributed Real-time Computation

Overview of this book

Table of Contents (17 chapters)
Storm Blueprints: Patterns for Distributed Real-time Computation
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Deploying the analytics


With Hadoop in place, we can now focus on the distributed processing frameworks that we will use for analysis.

Performing a batch analysis with the Pig infrastructure

The first of the distributed processing frameworks that we will examine is Pig. Pig is a framework for data analysis. It allows the user to articulate analysis in a simple high-level language. These scripts then compile down to MapReduce jobs.

Although Pig can read data from a few different systems (for example, S3), we will use HDFS as our data storage mechanism in this example. Thus, the first step in our analysis is to copy the data into HDFS.

To do this, we issue the following Hadoop commands:

hadoop fs -mkdir /user/bone/temp
hadoop fs -copyFromLocal click_thru_data.txt /user/bone/temp/

The preceding commands create a directory for the data file and copy the click-thru data file into that directory.

To execute a Pig script against that data, we will need to install Pig. For this, we simply download Pig...