Storm Real-time Processing Cookbook

Storm Real-time Processing Cookbook

By : Quinton Anderson

Buy this Book

Storm Real-time Processing Cookbook

By: Quinton Anderson

Buy this Book

Overview of this book

Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use! Storm Real Time Processing Cookbook will have basic to advanced recipes on Storm for real-time computation. The book begins with setting up the development environment and then teaches log stream processing. This will be followed by real-time payments workflow, distributed RPC, integrating it with other software such as Hadoop and Apache Camel, and more.

Storm Real-time Processing Cookbook

Credits

About the Author

About the Reviewers

www.packtpub.com

Preface

Free Chapter

Setting Up Your Development Environment

Introduction

Setting up your development environment

Distributed version control

Creating a "Hello World" topology

Creating a Storm cluster – provisioning the machines

Creating a Storm cluster – provisioning Storm

Deriving basic click statistics

Unit testing a bolt

Implementing an integration test

Deploying to the cluster

Log Stream Processing

Introduction

Creating a log agent

Creating the log spout

Rule-based analysis of the log stream

Indexing and persisting the log data

Counting and persisting log statistics

Creating an integration test for the log stream cluster

Creating a log analytics dashboard

Calculating Term Importance with Trident

Introduction

Creating a URL stream using a Twitter filter

Deriving a clean stream of terms from the documents

Calculating the relative importance of each term

Distributed Remote Procedure Calls

Introduction

Using DRPC to complete the required processing

Integration testing of a Trident topology

Implementing a rolling window topology

Simulating time in integration testing

Polyglot Topology

Introduction

Implementing the multilang protocol in Qt

Implementing the SplitSentence bolt in Qt

Implementing the count bolt in Ruby

Defining the word count topology in Clojure

Integrating Storm and Hadoop

Introduction

Implementing TF-IDF in Hadoop

Persisting documents from Storm

Integrating the batch and real-time views

Real-time Machine Learning

Introduction

Implementing a transactional topology

Creating a Random Forest classification model using R

Operational classification of transactional streams using Random Forest

Creating an association rules model in R

Creating a recommendation engine

Real-time online machine learning

Continuous Delivery

Introduction

Setting up a CI server

Setting up system environments

Defining a delivery pipeline

Implementing automated acceptance testing

Storm on AWS

Introduction

Deploying Storm on AWS using Pallet

Setting up a Virtual Private Cloud

Deploying Storm into Virtual Private Cloud using Vagrant

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Creating a Storm cluster – provisioning Storm

Once you have a base set of virtual machines that are ready for application provisioning, you need to install and configure the appropriate packages on each node.

How to do it…

Create a new project named storm-puppet with the following folder structure:

The entry point into the Puppet execution on the provisioned node is site.pp. Create it in the manifests folder:

node 'storm.nimbus' {
  $cluster = 'storm1'
  include storm::nimbus
  include storm::ui
}

node /storm.supervisor[1-9]/ {
  $cluster = 'storm1'
  include storm::supervisor
}

node /storm.zookeeper[1-9]/ {
  include storm::zoo
}

Next, you need to define the storm module. A module exists in the modules folder and has its own manifests and template folder structure, much as with the structure found at the root level of the Puppet project. Within the storm module, create the required manifests (modules/storm/manifests), starting with the init.pp file:
```
class storm {
  include storm::install
  include storm::config
}
```

The installation of the Storm application is the same on each storm node; only the configurations are adjusted where required, via templating. Next create the install.pp file, which will download the required binaries and install them:

class storm::install {

  $BASE_URL="https://bitbucket.org/qanderson/storm-deb-packaging/downloads/"
  $ZMQ_FILE="libzmq0_2.1.7_amd64.deb"
  $JZMQ_FILE="libjzmq_2.1.7_amd64.deb"
  $STORM_FILE="storm_0.8.1_all.deb"

  package { "wget": ensure => latest }
  
  # call fetch for each file
  exec { "wget_storm": 
    command => "/usr/bin/wget ${BASE_URL}${STORM_FILE}" }
  exec {"wget_zmq": 
    command => "/usr/bin/wget ${BASE_URL}${ZMQ_FILE}" }
  exec { "wget_jzmq": 
    command => "/usr/bin/wget ${BASE_URL}${JZMQ_FILE}" }
  
  #call package for each file
  package { "libzmq0":
    provider => dpkg,
    ensure => installed,
    source => "${ZMQ_FILE}",
    require => Exec['wget_zmq']
  }
  #call package for each file
  package { "libjzmq":
    provider => dpkg,
    ensure => installed,
    source => "${JZMQ_FILE}",
    require => [Exec['wget_jzmq'],Package['libzmq0']]
  }
  #call package for each file
  package { "storm":
    provider => dpkg,
    ensure => installed,
    source => "${STORM_FILE}",
    require => [Exec['wget_storm'], Package['libjzmq']]
  } 
}

Tip

The install manifest here assumes the existence of package, Debian packages, for Ubuntu. These were built using scripts and can be tweaked based on your requirements. The binaries and creation scripts can be found at https://bitbucket.org/qanderson/storm-deb-packaging.

The installation consists of the following packages:

Storm
ZeroMQ: http://www.zeromq.org/
Java-ZeroMQ

The configuration of each node is done through the template-based generation of the configuration files. In the storm manifests, create config.pp:

class storm::config {
  require storm::install
  include storm::params
  file { '/etc/storm/storm.yaml':
    require => Package['storm'],
    content => template('storm/storm.yaml.erb'),
    owner   => 'root',
    group   => 'root',
    mode    => '0644'
  }
  file { '/etc/default/storm':
    require => Package['storm'],
    content => template('storm/default.erb'),
    owner   => 'root',
    group   => 'root',
    mode    => '0644'
  }
}

All the storm parameters are defined using Hiera, with the Hiera configuration invoked from params.pp in the storm manifests:
```
class storm::params {
  #_ STORM DEFAULTS _#
  $java_library_path = hiera_array('java_library_path', 
      ['/usr/local/lib', '/opt/local/lib', '/usr/lib'])
}
```
Tip
Due to the sheer number of properties, the file has been concatenated. For the complete file, please refer to the Git repository at https://bitbucket.org/qanderson/storm-puppet/src.

Each class of node is then specified; here we will specify the nimbus class:

class storm::nimbus {
  require storm::install
  include storm::config
  include storm::params

  # Install nimbus /etc/default
  storm::service { 'nimbus':
    start      => 'yes',
    jvm_memory => $storm::params::nimbus_mem
  }

}

Specify the supervisor class:

class storm::supervisor {
  require storm::install
  include storm::config
  include storm::params

  # Install supervisor /etc/default
  storm::service { 'supervisor':
    start      => 'yes',
    jvm_memory => $storm::params::supervisor_mem
  }

}

Specify the ui class:

class storm::ui {
  require storm::install
  include storm::config
  include storm::params
  # Install ui /etc/default
  storm::service { 'ui':
    start      => 'yes',
    jvm_memory => $storm::params::ui_mem
  }

}

And finally, specify the zoo class (for a zookeeper node):

class storm::zoo {
  package {['zookeeper','zookeeper-bin','zookeeperd']:
    ensure => latest,
  }
}

Once all the files have been created, initialize the Git repository and push it to bitbucket.org.
In order to actually run the provisioning, navigate to the vagrant-storm-cluster folder and run the following command:
```
vagrant up
```
If you would like to ssh into any of the nodes, simply specify the following command:
```
vagrant ssh nimbus
```
Replace nimbus with your required node name.

How it works…

There are various patterns that can be applied when using Puppet. The simplest one is using a distributed model, whereby nodes provision themselves as opposed to a centralized model using Puppet Master. In the distributed model, updating server configuration simply requires that you update your provisioning manifests and push them to your central Git repository. The various nodes will then pull and apply this configuration. This can either be achieved through cron jobs, triggers, or through the use of a Continuous Delivery tool such as Jenkins, Bamboo, or Go. Provisioning in the development environment is explicitly invoked by Vagrant through the following command:

config.vm.provision :shell, :inline => "puppet apply /tmp/storm-puppet/manifests/site.pp --verbose --modulepath=/tmp/storm-puppet/modules/ --debug"

The manifest is then applied declaratively by the Puppet. Puppet is declarative, in that each language element specifies the desired state together with methods for getting there. This means that, when the system is already in the required state, that particular provisioning step will be skipped, together with the adverse effects of duplicate provisioning.

The storm-puppet project is therefore cloned onto the node and then the manifest is applied locally. Each node only applies provisioning for itself, based on the hostname specified in the site.pp manifest, for example:

node 'storm.nimbus' {
  $cluster = 'storm1'
  include storm::nimbus
  include storm::ui
}

In this case, the nimbus node will include the Hiera configurations for cluster1, and the installation for the nimbus and ui nodes will be performed. Any combination of classes can be included in the node definition, thus allowing the complete environment to be succinctly defined.

Storm Real-time Processing Cookbook

By : Quinton Anderson

Storm Real-time Processing Cookbook

By: Quinton Anderson

Overview of this book

Related Content you might be interested in

Current Title:

Storm Real-time Processing Cookbook

Creating a Storm cluster – provisioning Storm

How to do it…

Tip

Tip

How it works…