Book Image

Building Python Real time Applications with Storm

Book Image

Building Python Real time Applications with Storm

Overview of this book

Big data is a trending concept that everyone wants to learn about. With its ability to process all kinds of data in real time, Storm is an important addition to your big data “bag of tricks.” At the same time, Python is one of the fastest-growing programming languages today. It has become a top choice for both data science and everyday application development. Together, Storm and Python enable you to build and deploy real-time big data applications quickly and easily. You will begin with some basic command tutorials to set up storm and learn about its configurations in detail. You will then go through the requirement scenarios to create a Storm cluster. Next, you’ll be provided with an overview of Petrel, followed by an example of Twitter topology and persistence using Redis and MongoDB. Finally, you will build a production-quality Storm topology using development best practices.
Table of Contents (14 chapters)

Storm administration over a cluster


There are many tools available that can create multiple virtual machines, install predefined software and even manage the state of that software.

Introducing supervisord

Supervisord is a process control system. It is a client-server system that allows its users to monitor and control a number of processes on Unix-like operating systems. For details, visit http://supervisord.org/.

Supervisord components

The server piece of the supervisor is known as supervisord. It is responsible for starting child programs upon its own invocation, responding to commands from clients, restarting crashed or exited subprocesses, logging its subprocess stdout and stderr output, and generating and handling "events" corresponding to points in subprocess lifetimes. The server process uses a configuration file. This is typically located in /etc/supervisord.conf. This configuration file is a Windows-INI style config file. It is important to keep this file secure via proper filesystem permissions because it might contain decrypted usernames and passwords:

  • supervisorctl: The command-line client piece of the supervisor is known as supervisorctl. It provides a shell-like interface for the features provided by supervisord. From supervisorctl, a user can connect to different supervisord processes. They can get the status on the subprocesses controlled by, stop and start subprocesses of, and get lists of running processes of a supervisord. The command-line client talks to the server across a Unix domain socket or an Internet (TCP) socket. The server can assert that the user of a client should present authentication credentials before it allows them to use commands. The client process typically uses the same configuration file as the server, but any configuration file with a [supervisorctl] section in it will work.

  • Web server: A (sparse) web user interface with functionality comparable to supervisorctl may be accessed via a browser if you start supervisord against an Internet socket. Visit the server URL (for example, http://localhost:9001/) to view and control the process status through the web interface after activating the configuration file's [inet_http_server] section.

  • XML-RPC interface: The same HTTP server that serves the web UI serves up an XML-RPC interface that can be used to interrogate and control the supervisor and the programs it runs. See XML-RPC API Documentation.

  • Machines: Let's assume that we have two EC2 machines of IP addresses 172-31-19-62 and 172.31.36.23. We will install supervisord on both machines and later configure to decide what services of Storm would be running on each machine.

  • Storm and Zookeeper setup: Let's run Zookeeper, Nimbus, supervisor, and the UI on machine 172.31.36.23 and only the supervisor on 172-31-19-62.

  • Zookeeper version: zookeeper-3.4.6.tar.gz.

  • Storm version: apache-storm-0.9.5.tar.gz.

Here is the process of the Zookeeper server setup and configuration:

  1. Download Zookeeper's latest version and extract it:

    tar –xvf zookeeper-3.4.6.tar.gz
  2. Configure zoo.cfg in the conf directory to start Zookeeper in cluster mode.

  3. Zookeeper conf:

    server.1=172.31.36.23:2888:3888
    tickTime=2000
    initLimit=10
    syncLimit=5
    # the directory where the snapshot is stored.
    dataDir=/home/ec2-user/zookeeper-3.4.6/tmp/zookeeper
    clientPort=2181
  4. Make sure that the directory specified in dataDir is created and the user has read and write permissions on it.

  5. Then, go to the Zookeeper bin directory and start the zookeeper server using the following command:

    [ec2-user@ip-172-31-36-23 bin~]$ zkServer.sh start

Storm server setup and configuration:

  1. Download Storm's latest version from the Apache Storm website and extract it:

    tar –xvf apache-storm-0.9.5.tar.gz
  2. Here is the configuration of the Storm Nimbus machine as well as the slave (added/changed configuration only):

    storm.zookeeper.servers: - "172.31.36.23"
    
    nimbus.host: "172.31.36.23"
    
    nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
    
    ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
    
    supervisor.childopts: "-Djava.net.preferIPv4Stack=true"
    
    worker.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
    
    storm.local.dir: "/home/ec2-user/apache-storm-0.9.5/local"
    
    supervisor.slots.ports:
        - 6700
        - 6701
        - 6702
        - 6703

Supervisord installation

It is possible to install supervisord by the following two ways:

  1. Installing on a system with Internet access:

    Download the Setup tool and use the easy_install method.

  2. Installing on a system without Internet access:

    Download all dependencies, copy to each machine, and then install it.

We will follow the second method of installation, the one in which Internet access is not required. We will download all dependencies and supervisord, and copy it to the servers.

Supervisord [supervisor-3.1.3.tar.gz] requires the following dependencies to be installed:

Let's install supervisord and the necessary dependencies on both machines, 172.31.36.23 and 172-31-19-62.

The following are the steps for installing the dependencies:

  1. setuptools:

    • Unzip the .zip file using this command:

      [ec2-user@ip-172-31-19-62 ~]$ tar -xvf setuptools-17.1.1.zip
    • Go to the setuptools-17.1.1 directory and run the installation command with sudo:

      [ec2-user@ip-172-31-19-62 setuptools-17.1.1]$ sudo python setup.py install
      storm.zookeeper.servers: - "172.31.36.23"
      
      nimbus.host: "172.31.36.23"
      
      nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
      
      ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
      
      supervisor.childopts: "-Djava.net.preferIPv4Stack=true"
      
      worker.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true"
      
      storm.local.dir: "/home/ec2-user/apache-storm-0.9.5/local"
      
      supervisor.slots.ports:
          - 6700
          - 6701
          - 6702
          - 6703
  2. meld3:

    • Extract the .ts.gz file using the following command:

      [ec2-user@ip-172-31-19-62 ~]$ tar -xvf meld3-0.6.5.tar.gz
    • Go to the meld3.-0.6.5 directory and run this command:

      [ec2-user@ip-172-31-19-62 meld3-0.6.5]$ sudo pyth setup.py install
  3. elementtree:

    • Extract the .ts.gz file:

      [ec2-user@ip-172-31-19-62 ~]$ tar -xvf elementtree-1.2-20040618.tar.gz
    • Go to elementtree-1.2-20040618 and run the following command:

      [ec2-user@ip-172-31-19-62 elementtree-1.2-20040618]$ sudo python setup.py install

The following are the supervisord installations:

  • Extract supervisor-3.1.3 using this command:

    [ec2-user@ip-172-31-19-62 ~]$ tar -xvf supervisor-3.1.3.tar.gz
  • Go to the supervisor-3.1.3 directory and run the following command:

    [ec2-user@ip-172-31-19-62 supervisor-3.1.3]$ sudo python setup.py install

Note

A similar setup of supervisord is required on another machine, that is, 172.31.36.23.

Configuration of supervisord.conf

Lets configure services on the 172.31.36.23 machine and assume that the supervisord installation is done as explained previously. Once supervisor is installed, you can build the supervisord.conf file to start the supervisord and supervisorctl commands:

  • Make the supervisor.conf file. Put it into the /etc directory.

  • We can refer get sample supervisord.conf using the following command:

    [ec2-user@ip-172-31-36-23 ~]$ echo_supervisord_conf

Take a look at the supervisord.conf file:

[unix_http_server]
file = /home/ec2-user/supervisor.sock
chmod = 0777

[inet_http_server]         ; inet (TCP) server disabled by default
port=172.31.36.23:9001        ; (ip_address:port specifier, *:port for all iface)
username=user              ; (default is no username (open server))
password=123               ; (default is no password (open server))

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisord]
logfile_backups=10           ; (num of main logfile rotation backups;default 10)
logfile=/home/ec2-user/supervisord.log ; (main log file;default $CWD/supervisord.log)
logfile_maxbytes=50MB        ; (max main logfile bytes b4 rotation;default 50MB)
pidfile=/home/ec2-user/supervisord.pid ; (supervisord pidfile;default supervisord.pid)
nodaemon=false               ; (start in foreground if true;default false)
minfds=1024                  ; (min. avail startup file descriptors;default 1024)

[supervisorctl]
;serverurl = unix:///home/ec2-user/supervisor.sock
serverurl=http://172.31.36.23:9001 ; use an http:// url to specify an inet socket
;username=chris              ; should be same as http_username if set
;password=123                ; should be same as http_password if set

[program:storm-nimbus]
command=/home/ec2-user/apache-storm-0.9.5/bin/storm nimbus
user=ec2-user
autostart=false
autorestart=false
startsecs=10
startretries=999
log_stdout=true
log_stderr=true
stdout_logfile=/home/ec2-user/storm/logs/nimbus.out
logfile_maxbytes=20MB
logfile_backups=10

[program:storm-ui]
command=/home/ec2-user/apache-storm-0.9.5/bin/storm ui
user=ec2-user
autostart=false
autorestart=false
startsecs=10
startretries=999
log_stdout=true
log_stderr=true
stdout_logfile=/home/ec2-user/storm/logs/ui.out
logfile_maxbytes=20MB
logfile_backups=10

[program:storm-supervisor]
command=/home/ec2-user/apache-storm-0.9.5/bin/storm supervisor
user=ec2-user
autostart=false
autorestart=false
startsecs=10
startretries=999
log_stdout=true
log_stderr=true
stdout_logfile=/home/ec2-user/storm/logs/supervisor.out
logfile_maxbytes=20MB
logfile_backups=10

Start the supervisor server first:

[ec2-user@ip-172-31-36-23 ~] sudo /usr/bin/supervisord -c /etc/supervisord.conf

Then, start all processes using supervisorctl:

[ec2-user@ip-172-31-36-23 ~] sudo /usr/bin/supervisorctl -c /etc/supervisord.conf status
storm-nimbus                     STOPPED   Not started
storm-supervisor                 STOPPED   Not started
storm-ui                         STOPPED   Not started
[ec2-user@ip-172-31-36-23 ~]$ sudo /usr/bin/supervisorctl -c /etc/supervisord.conf start all
storm-supervisor: started
storm-ui: started
storm-nimbus: started
[ec2-user@ip-172-31-36-23 ~]$ jps
14452 Jps
13315 QuorumPeerMain
14255 nimbus
14233 supervisor
14234 core
[ec2-user@ip-172-31-36-23 ~]$

We can view the supervisord web UI and control processes on the browser. 52.11.193.108 is the public IP address of the 172-31-36-23 machine (http://52.11.193.108:9001):

Configuration of supervisord.conf on 172-31-19-62

Keep only the following services in the configuration file:

[unix_http_server]
[rpcinterface:supervisor]
[supervisord]
[supervisorctl]
[program:storm-supervisor]

After that, you can start the supervisor server and all processes using supervisorctl on 172-31-19-62 machine.