With the exponential growth in the amount of data being generated and advanced data-capturing capabilities, enterprises are facing the challenge of making sense out of this mountain of raw data. On the batch processing front, Hadoop has emerged as the go-to framework to deal with Big Data. Until recently, there has been a void when one looks for frameworks to build real-time stream processing applications. Such applications have become an integral part of a lot of businesses as they enable them to respond swiftly to events and adapt to changing situations. Examples of this are monitoring social media to analyze public response to any new product that you launch and predicting the outcome of an election based on the sentiments of the election-related posts.
Apache Storm has emerged as the platform of choice for the industry leaders to develop such distributed, real-time, data processing platforms. It provides a set of primitives that can be used to develop applications that can process a very large amount of data in real time in a highly scalable manner.
Storm is to real-time processing what Hadoop is to batch processing. It is an open source software, currently being incubated at the Apache Software Foundation. Being in incubation does not mean that it is not yet ready for actual production. Indeed, it has been deployed to meet real-time processing needs by companies such as Twitter, Yahoo!, and Flipboard. Storm was first developed by Nathan Marz at BackType, a company that provided social search applications. Later, BackType was acquired by Twitter, and now it is a critical part of their infrastructure. Storm can be used for the following use cases:
Stream processing: Storm is used to process a stream of data and update a variety of databases in real time. This processing occurs in real time and the processing speed needs to match the input data speed.
Continuous computation: Storm can do continuous computation on data streams and stream the results into clients in real time. This might require processing each message as it comes or creating small batches over a little time. An example of continuous computation is streaming trending topics on Twitter into browsers.
Distributed RPC: Storm can parallelize an intense query so that you can compute it in real time.
Real-time analytics: Storm can analyze and respond to data that comes from different data sources as they happen in real time.
In this chapter, we will cover the following topics:
Features of Storm
Various components of a Storm cluster
What is a Storm topology
Local and remote operational modes to execute Storm topologies
Setting up a development environment to develop a Storm topology
Developing a sample topology
Setting up a single-node Storm cluster and its prerequisites
Deploying the sample topology