Book Image

Storm Real-time Processing Cookbook

By : Quinton Anderson
Book Image

Storm Real-time Processing Cookbook

By: Quinton Anderson

Overview of this book

<p>Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!<br />Storm Real Time Processing Cookbook will have basic to advanced recipes on Storm for real-time computation.<br /><br />The book begins with setting up the development environment and then teaches log stream processing. This will be followed by real-time payments workflow, distributed RPC, integrating it with other software such as Hadoop and Apache Camel, and more.</p>
Table of Contents (16 chapters)
Storm Real-time Processing Cookbook
Credits
About the Author
About the Reviewers
www.packtpub.com
Preface
Index

Creating a URL stream using a Twitter filter


There are many approaches to sourcing input documents for the TF-IDF implementation. This recipe will present an approach using Twitter.

Twitter provides a stream API that allows you to receive a sample of the total tweets within Twitter. The approach of using a sample is more than sufficient for most applications, as more data may not improve your results, especially in any meaningful way relative to the costs involved. For this reason, this is the only way Twitter allows you to consume the data without special agreements in place.

Tweet status streams can be filtered using the Twitter streaming API, so that only a subset of the population is sampled and delivered in a stream. This enables one to listen for tweets for a particular topic. Furthermore, tweets often have links attached to them, which is where the bulk of the information is held given the small character limit on the tweet itself.

The approach for this recipe is therefore to subscribe...