Book Image

Learning Hadoop 2

Book Image

Learning Hadoop 2

Overview of this book

Table of Contents (18 chapters)
Learning Hadoop 2
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Building a tweet analysis capability


In earlier chapters, we used various implementations of Twitter data analysis to describe several concepts. We will take this capability to a deeper level and approach it as a major case study.

In this chapter, we will build a data ingest pipeline, constructing a production-ready dataflow that is designed with reliability and future evolution in mind.

We'll build out the pipeline incrementally throughout the chapter. At each stage, we'll highlight what has changed but can't include full listings at each stage without trebling the size of the chapter. The source code for this chapter, however, has every iteration in its full glory.

Getting the tweet data

The first thing we need to do is get the actual tweet data. As in previous examples, we can pass the -j and -n arguments to stream.py to dump JSON tweets to stdout:

$ stream.py -j -n 10000 > tweets.json

Since we have this tool that can create a batch of sample tweets on demand, we could start our ingest...