Book Image

Building Python Real time Applications with Storm

Book Image

Building Python Real time Applications with Storm

Overview of this book

Big data is a trending concept that everyone wants to learn about. With its ability to process all kinds of data in real time, Storm is an important addition to your big data “bag of tricks.” At the same time, Python is one of the fastest-growing programming languages today. It has become a top choice for both data science and everyday application development. Together, Storm and Python enable you to build and deploy real-time big data applications quickly and easily. You will begin with some basic command tutorials to set up storm and learn about its configurations in detail. You will then go through the requirement scenarios to create a Storm cluster. Next, you’ll be provided with an overview of Petrel, followed by an example of Twitter topology and persistence using Redis and MongoDB. Finally, you will build a production-quality Storm topology using development best practices.
Table of Contents (14 chapters)

Tuning parallelism in Storm – scaling a distributed computation


To explain parallelism of Storm, we will configure three parameters:

  • The number of workers

  • The number of executors

  • The number of tasks

The following figure gives a diagrammatic explanation of an example where we have a topology with just one spout and one bolt. In this case, we will set different values for the numbers of workers, executors, and tasks at the spout and bolt levels, and see how parallelism works in each case:

// assume we have two workers in total for topology.
topology.workers: 2
 // just one executor of spout.
builder.setSpout("spout-sentence", TwitterStreamSpout(),1)

// two executors of bolt.
builder.setBolt("bolt-split", SplitSentenceBolt(),2)
 // four tasks for bolts.
.setNumTasks(4)
.shuffleGrouping("spout-sentence");

For this configuration, we will have two workers, which will run in separate JVMs (worker 1 and worker 2).

For the spout, there is one executor, and the default number of tasks is one, which makes...