Book Image

Storm Blueprints: Patterns for Distributed Real-time Computation

Book Image

Storm Blueprints: Patterns for Distributed Real-time Computation

Overview of this book

Table of Contents (17 chapters)
Storm Blueprints: Patterns for Distributed Real-time Computation
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Examining our use case


Now, let's apply this pattern to the field of Natural Language Processing (NLP). In this use case, we will search Twitter for relevant tweets for a phrase (for example, "Apple Jobs"). The system will then process those tweets trying to find the most relevant words. Using Druid to aggregate the terms, we will be able to trend the most relevant words over time.

Let's define the problem a little more. Given a search phrase p, using the Twitter API, we will find the most relevant sets of Tweets, T. For each tweet, t in T, we will count the occurrences of each word, w. We will compare the frequency of that word in the tweets with the frequency of that word in a sample of English text, E. The system will then rank those words and display the top 20 results.

Mathematically, this equates to the following:

Here, the frequency of a word w in a corpus C is as follows:

Since we are only concerned with the relative frequency, and the total count of words in T and words in E are constant...