Book Image

Building Python Real-Time Applications with Storm

By : Kartik Bhatnagar, Barry Hart
Book Image

Building Python Real-Time Applications with Storm

By: Kartik Bhatnagar, Barry Hart

Overview of this book

Big data is a trending concept that everyone wants to learn about. With its ability to process all kinds of data in real time, Storm is an important addition to your big data “bag of tricks.” At the same time, Python is one of the fastest-growing programming languages today. It has become a top choice for both data science and everyday application development. Together, Storm and Python enable you to build and deploy real-time big data applications quickly and easily. You will begin with some basic command tutorials to set up storm and learn about its configurations in detail. You will then go through the requirement scenarios to create a Storm cluster. Next, you’ll be provided with an overview of Petrel, followed by an example of Twitter topology and persistence using Redis and MongoDB. Finally, you will build a production-quality Storm topology using development best practices.
Table of Contents (14 chapters)

A physical view of a Storm cluster

The next figure explains the physical position of each process. There can be only one Nimbus. However, more than one Zookeeper is there to support failover, and per machine, there is one supervisor.

Stream grouping

A stream grouping controls the flow of tuples between from spout to bolt or bolt to bolt. In Storm, we have four types of groupings. Shuffle and field grouping are most commonly used:

  • Shuffle grouping: Tuple flow between two random tasks in this grouping

  • Field grouping: A tuple with a particular field key is always delivered to the same task of the downstream bolt

  • All grouping: Sends the same tuple to all tasks of the downstream bolt

  • Global grouping: Tuples from all tasks reach one task

The subsequent figure gives a diagrammatic explanation of all the four types of groupings:

Fault tolerance in Storm

Supervisor runs a synchronization thread to get assignment information (what part of topology I am supposed to run) from Zookeeper and write to the local...