Storm Real-time Processing Cookbook

Storm Real-time Processing Cookbook

By : Quinton Anderson

Buy this Book

Storm Real-time Processing Cookbook

By: Quinton Anderson

Buy this Book

Overview of this book

Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use! Storm Real Time Processing Cookbook will have basic to advanced recipes on Storm for real-time computation. The book begins with setting up the development environment and then teaches log stream processing. This will be followed by real-time payments workflow, distributed RPC, integrating it with other software such as Hadoop and Apache Camel, and more.

Storm Real-time Processing Cookbook

Credits

About the Author

About the Reviewers

www.packtpub.com

Preface

Free Chapter

Setting Up Your Development Environment

Introduction

Setting up your development environment

Distributed version control

Creating a "Hello World" topology

Creating a Storm cluster – provisioning the machines

Creating a Storm cluster – provisioning Storm

Deriving basic click statistics

Unit testing a bolt

Implementing an integration test

Deploying to the cluster

Log Stream Processing

Introduction

Creating a log agent

Creating the log spout

Rule-based analysis of the log stream

Indexing and persisting the log data

Counting and persisting log statistics

Creating an integration test for the log stream cluster

Creating a log analytics dashboard

Calculating Term Importance with Trident

Introduction

Creating a URL stream using a Twitter filter

Deriving a clean stream of terms from the documents

Calculating the relative importance of each term

Distributed Remote Procedure Calls

Introduction

Using DRPC to complete the required processing

Integration testing of a Trident topology

Implementing a rolling window topology

Simulating time in integration testing

Polyglot Topology

Introduction

Implementing the multilang protocol in Qt

Implementing the SplitSentence bolt in Qt

Implementing the count bolt in Ruby

Defining the word count topology in Clojure

Integrating Storm and Hadoop

Introduction

Implementing TF-IDF in Hadoop

Persisting documents from Storm

Integrating the batch and real-time views

Real-time Machine Learning

Introduction

Implementing a transactional topology

Creating a Random Forest classification model using R

Operational classification of transactional streams using Random Forest

Creating an association rules model in R

Creating a recommendation engine

Real-time online machine learning

Continuous Delivery

Introduction

Setting up a CI server

Setting up system environments

Defining a delivery pipeline

Implementing automated acceptance testing

Storm on AWS

Introduction

Deploying Storm on AWS using Pallet

Setting up a Virtual Private Cloud

Deploying Storm into Virtual Private Cloud using Vagrant

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

Open source has changed the software landscape in many fundamental ways. There are many arguments that can be made for and against using open source in any given situation, largely in terms of support, risk, and total cost of ownership. Open source is more popular in certain settings than others, such as research institutions versus large institutional financial service providers. Within the emerging areas of web service providers, content provision, and social networking, open source is dominating the landscape. This is true for many reasons, cost being a large one among them. These solutions that need to grow to "Web scale" have been classified as "Big Data" solutions, for want of a better term. These solutions serve millions of requests per second with extreme levels of availability, all the while providing customized experiences for customers across a wide range of services.

Designing systems at this scale requires us to think about problems differently, architect solutions differently, and learn where to accept complexity and where to avoid it. As an industry, we have come to grips with designing batch systems that scale. Large-scale computing clusters following MapReduce, Bulk Synchronous Parallel, and other computational paradigms are widely implemented and well understood. The surge of innovation has been driven and enabled by open source, leaving even the top software vendors struggling to integrate Hadoop into their technology stack, never mind trying to implement some level of competition to it.

Customers, however, have grown an insatiable desire for more. More data, more services, more value, more convenience, and they want it now and at lower cost. As the sheer volume of data increases, the demand for real-time response time increases too. The next phase of computational platforms has arrived, and it is focused on real time, at scale. It represents many unique challenges, and is both theoretically and practically challenging.

This cookbook will help you master a platform, the Storm processor. The Storm processor is an open source, real-time computational platform written by Nathan Marz at Backtype, a social analytics company. It was later purchased by Twitter and released as open source. It has since thrived in an ever-expanding open source community of users, contributors, and success stories within production sites. At the time of writing this preface, the project was enjoying more than 6,000 stars on GitHub, 3,000 Twitter followers, has been benchmarked at over a million transactions per second per node, and has almost 80 reference customers with production instances of Storm.

These are extremely impressive figures. Moreover, you will find by the end of this book that it is also extremely enjoyable to deliver systems based on Storm, using whichever language is congruent with your way of thinking and delivering solutions.

This book is designed to teach you Storm with a series of practical examples. These examples are grounded in real-world use cases, and introduce various concepts as the book unfolds. Furthermore, the book is designed to promote DevOps practice around the Storm technology, enabling the reader to develop Storm solutions and deliver them reliably into production, where they create value.

An introduction to the Storm processor

A common criticism of open source projects is their lack of documentation. Storm does not suffer from this particular issue; the documentation for the project is excellent, well-written, and well-supplemented by the vibrant user community. This cookbook does not seek to duplicate this documentation but rather supplement it, driven largely by examples with conceptual and theoretical discussion where required. It is highly recommended that the reader take the time to read the Storm introductory documentation before proceeding to Chapter 1, Setting Up Your Development Environment, specifically the following pages of the Storm wiki:

What this book covers

Chapter 1, Setting Up Your Development Environment, will demonstrate the process of setting up a local development environment for Storm; this includes all required tooling and suggested development workflows.

Chapter 2, Log Stream Processing, will lead the reader through the process of creating a log stream processing solution, complete with a base statistics dashboard and log-searching capability.

Chapter 3, Calculating Term Importance with Trident, will introduce the reader to Trident, a data-flow abstraction that works on top of Storm to enable highly productive enterprise data pipelines.

Chapter 4, Distributed Remote Procedure Calls, will teach the user how to implement distributed remote procedure calls.

Chapter 5, Polyglot Topology, will guide the reader to develop a Polyglot technology and add new technologies to the list of already supported technologies.

Chapter 6, Integrating Storm with Hadoop, will guide the user through the process of integrating Storm with Hadoop, thus creating a complete Lambda architecture.

Chapter 7, Real-time Machine Learning, will provide the reader with a very basic introduction to machine learning as a topic, and provides various approaches to implementing it in real-time projects based on Storm.

Chapter 8, Continuous Delivery, will demonstrate how to set up a Continuous Delivery pipeline and deliver a Storm cluster reliably into an environment.

Chapter 9, Storm on AWS, will guide the user through various approaches to automated provisioning of a Storm cluster into the Amazon Computing Cloud.

What you need for this book

This book assumes a base environment of Ubuntu or Debian. The first chapter will guide the reader through the process of setting up the remaining required tooling. If the reader does not use Ubuntu as a developer operating system, any *Nix-based system is preferred, as all the recipes assume the existence of a bash command interface.

Who this book is for

Storm Real-time Processing Cookbook is ideal for developers who would like to learn real-time processing or would like to learn how to use Storm for real-time processing. It's assumed that you are a Java developer. Clojure, C++, and Ruby experience would be useful but is not essential. It would also be useful to have some experience with Hadoop or similar technologies.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "You must then create your first spout by creating a new class named HelloWorldSpout, which extends from BaseRichSpout and is located in the storm.cookbook package."

A block of code is set as follows:

<repositories>

    <repository>
      <id>github-releases</id>
      <url>http://oss.sonatype.org/content/repositories/github-releases/</url>
    </repository>

    <repository>
      <id>clojars.org</id>
      <url>http://clojars.org/repo</url>
    </repository>

    <repository>
      <id>twitter4j</id>
      <url>http://twitter4j.org/maven2</url>
    </repository>
</repositories>

Any command-line input or output is written as follows:

mkdir FirstGitProject
cd FirstGitProject
git init

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Uncheck the Use default location checkbox."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Open source versions of the code are maintained by the author at his Bitbucket account: https://bitbucket.org/qanderson.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Storm Real-time Processing Cookbook

By : Quinton Anderson

Storm Real-time Processing Cookbook

By: Quinton Anderson

Overview of this book

Related Content you might be interested in

Current Title:

Storm Real-time Processing Cookbook

Preface

An introduction to the Storm processor

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions