Book Image

Storm Real-time Processing Cookbook

By : Quinton Anderson
Book Image

Storm Real-time Processing Cookbook

By: Quinton Anderson

Overview of this book

<p>Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!<br />Storm Real Time Processing Cookbook will have basic to advanced recipes on Storm for real-time computation.<br /><br />The book begins with setting up the development environment and then teaches log stream processing. This will be followed by real-time payments workflow, distributed RPC, integrating it with other software such as Hadoop and Apache Camel, and more.</p>
Table of Contents (16 chapters)
Storm Real-time Processing Cookbook
Credits
About the Author
About the Reviewers
www.packtpub.com
Preface
Index

Introduction


This chapter will present the implementation of a very well-known data processing algorithm, Term Frequency–Inverse Document Frequency (TF-IDF), using Storm's Trident API. TF-IDF is a numerical statistic that reflects how important a word is to a document within a collection of documents. This is often a key concern in search engines but is also an important starting point in sentiment mining, as the trend of the important words within textual content can be an extremely useful predictor or an analytical tool.

Tip

TF-IDF drives many search engines, such as Apache Lucence. If you want the details of how it is used in this context, please read the documentation for the Similarity class in Apache Lucence at http://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/search/Similarity.html.

According to the Storm project wiki (https://github.com/nathanmarz/storm/wiki/Trident-tutorial), Trident is a new high-level abstraction for doing real-time computing on top of Storm. It allows...