Mastering Apache Storm

Mastering Apache Storm

By : Ankit Jain

Buy this Book

Mastering Apache Storm

By: Ankit Jain

Buy this Book

Overview of this book

Apache Storm is a real-time Big Data processing framework that processes large amounts of data reliably, guaranteeing that every message will be processed. Storm allows you to scale your data as it grows, making it an excellent platform to solve your big data problems. This extensive guide will help you understand right from the basics to the advanced topics of Storm. The book begins with a detailed introduction to real-time processing and where Storm fits in to solve these problems. You’ll get an understanding of deploying Storm on clusters by writing a basic Storm Hello World example. Next we’ll introduce you to Trident and you’ll get a clear understanding of how you can develop and deploy a trident topology. We cover topics such as monitoring, Storm Parallelism, scheduler and log processing, in a very easy to understand manner. You will also learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, Kafka, and Hadoop to realize the full potential of Storm. With real-world examples and clear explanations, this book will ensure you will have a thorough mastery of Apache Storm. You will be able to use this knowledge to develop efficient, distributed real-time applications to cater to your business needs.

Title Page

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Real-Time Processing and Storm Introduction

Programming languages

Summary

Storm Deployment, Topology Development, and Topology Options

Storm prerequisites

Setting up the Storm cluster

Developing the hello world example

The different options of the Storm topology

Walkthrough of the Storm UI

Dynamic log level settings

Summary

Storm Parallelism and Data Partitioning

Parallelism of a topology

Rebalance the parallelism of a topology

Different types of stream grouping in the Storm cluster

Guaranteed message processing

Tick tuple

Summary

Trident Introduction

Trident introduction

Understanding Trident's data model

Writing Trident functions, filters, and projections

Trident repartitioning operations

Trident aggregator

Utilizing the groupBy operation

When to use Trident

Summary

Trident Topology and Uses

Trident groupBy operation

Non-transactional topology

Trident hello world topology

Trident state

Distributed RPC

When to use Trident

Summary

Storm Scheduler

Introduction to Storm scheduler

Default scheduler

Isolation scheduler

Resource-aware scheduler

Custom scheduler

Summary

Monitoring of Storm Cluster

Cluster statistics using the Nimbus thrift client

Monitoring the Storm cluster using JMX

Monitoring the Storm cluster using Ganglia

Summary

Integration of Storm and Kafka

Introduction to Kafka

Kafka architecture

Installation of Kafka brokers

Share ZooKeeper between Storm and Kafka

Kafka producers and publishing data into Kafka

Kafka Storm integration

Deploy the Kafka topology on Storm cluster

Summary

Storm and Hadoop Integration

Introduction to Hadoop

Installation of Hadoop

Write Storm topology to persist data into HDFS

Integration of Storm with Hadoop

Setting up Storm-YARN

Storm-Starter topologies on Storm-YARN

Summary

Storm Integration with Redis, Elasticsearch, and HBase

Integrating Storm with HBase

Integrating Storm with Redis

Integrating Storm with Elasticsearch

Integrating Storm with Esper

Summary

Apache Log Processing with Storm

Apache log processing elements

Producing Apache log in Kafka using Logstash

Splitting the Apache log line

Identifying country, operating system type, and browser type from the log file

Calculate the search keyword

Persisting the process data

Kafka spout and define topology

Deploy topology

MySQL queries

Summary

Twitter Tweet Collection and Machine Learning

Exploring machine learning

Twitter sentiment analysis

Kafka spout, sentiments bolt, and HDFS bolt

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

Real-time data processing in no longer a luxury exercised by a few big companies but has become a necessity for businesses that want to compete, and Apache Storm is one of the de facto standards for developing real-time processing pipelines. The key features of Storm are that it is horizontally scalable, is fault tolerant, and provides guaranteed message processing. Storm can solve various types of analytic problem: machine learning, log processing, graph analysis, and so on.

Mastering Storm will serve both as a getting started guide to inexperienced developers and as a reference for implementing advanced use cases with Storm for experienced developers. In the first two chapters, you will learn the basics of a Storm topology and various components of a Storm cluster. In the later chapters, you will learn how to build a Storm application that can interact with various other big data technologies and how to create transactional topologies. Finally, the last two chapters cover case studies for log processing and machine learning. We are also going to cover how we can use the Storm scheduler to assign delicate work to delicate machines.

What this book covers

Chapter 1, Real-Time Processing and Storm Introduction, gives an introduction to Storm and its components.

Chapter 2, Storm Deployment, Topology Development, and Topology Options, covers deploying Storm into the cluster, deploying the sample topology on a Storm cluster, how we can monitor the storm pipeline using storm UI, and how we can dynamically change the log level settings.

Chapter 3, Storm Parallelism and Data Partitioning, covers the parallelism of topology, how to configure parallelism at the code level, guaranteed message processing, and Storm internally generated tuples.

Chapter 4, Trident Introduction, covers an introduction to Trident, an understanding of the Trident data model, and how we can write Trident filters and functions. This chapter also covers repartitioning and aggregation operations on Trident tuples.

Chapter 5, Trident Topology and Uses, introduces Trident tuple grouping, non-transactional topology, and a sample Trident topology. The chapter also introduces Trident state and distributed RPC.

Chapter 6, Storm Scheduler, covers different types of scheduler available in Storm: the default scheduler, isolation scheduler, resource-aware scheduler, and custom scheduler.

Chapter 7, Monitoring of the Storm Cluster, covers monitoring Storm by writing custom monitoring UIs using the stats published by Nimbus. We explain the integration of Ganglia with Storm using JMXTrans. This chapter also covers how we can configure Storm to publish JMX metrics.

Chapter 8, Integration of Storm and Kafka, shows the integration of Storm with Kafka. This chapter starts with an introduction to Kafka, covers the installation of Storm, and ends with the integration of Storm with Kafka to solve any real-world problem.

Chapter 9, Storm and Hadoop Integration, covers an overview of Hadoop, writing the Storm topology to publish data into HDFS, an overview of Storm-YARN, and deploying the Storm topology on YARN.

Chapter 10, Storm Integration with Redis, Elasticsearch, and HBase, teaches you how to integrate Storm with various other big data technologies.

Chapter 11, Apache Log Processing with Storm, covers a sample log processing application in which we parse Apache web server logs and generate some business information from log files.

Chapter 12, Twitter Tweets Collection and Machine Learning, walks you through a case study implementing a machine learning topology in Storm.

What you need for this book

All of the code in this book has been tested on CentOS 6.5. It will run on other variants of Linux and Windows as well with appropriate changes in commands.

We have tried to keep the chapters self-contained, and the setup and installation of all the software used in each chapter are included in the chapter itself. These are the software packages used throughout the book:

CentOS 6.5
Oracle JDK 8
Apache ZooKeeper 3.4.6
Apache Storm 1.0.2
Eclipse or Spring Tool Suite
Elasticsearch 2.4.4
Hadoop 2.2.2
Logstash 5.4.1
Kafka 0.9.0.1
Esper 5.3.0

Who this book is for

If you are a Java developer and are keen to enter into the world of real-time stream processing applications using Apache Storm, then this book is for you. No previous experience in Storm is required as this book starts from the basics. After finishing this book, you will be able to develop not-so-complex Storm applications.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Add the following line in the storm.yaml file of the Nimbus machine to enable JMX on the Nimbus node."

A block of code is set as follows:

<dependency>
  <groupId>org.apache.storm</groupId>
  <artifactId>storm-core</artifactId>
  <version>1.0.2</version>
  <scope>provided<scope>
</dependency>

Any command-line input or output is written as follows:

cd $ZK_HOME/conf
touch zoo.cfg

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Now, click on the Connect button to view the metrics of the supervisor node."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support, and register to have the files e-mailed directly to you. You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Mastering-Apache-Storm. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringApacheStorm_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support, and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Mastering Apache Storm

By : Ankit Jain

Mastering Apache Storm

By: Ankit Jain

Overview of this book

Related Content you might be interested in

Current Title:

Mastering Apache Storm

Practical Real-time Data Processing and Analytics

Building Data Streaming Applications with Apache Kafka

Apache Kafka 1.0 Cookbook

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Note

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

Questions