Book Image

Learning Apache Apex

By : Thomas Weise, Ananth Gundabattula, Munagala V. Ramanath, David Yan, Kenneth Knowles
Book Image

Learning Apache Apex

By: Thomas Weise, Ananth Gundabattula, Munagala V. Ramanath, David Yan, Kenneth Knowles

Overview of this book

Apache Apex is a next-generation stream processing framework designed to operate on data at large scale, with minimum latency, maximum reliability, and strict correctness guarantees. Half of the book consists of Apex applications, showing you key aspects of data processing pipelines such as connectors for sources and sinks, and common data transformations. The other half of the book is evenly split into explaining the Apex framework, and tuning, testing, and scaling Apex applications. Much of our economic world depends on growing streams of data, such as social media feeds, financial records, data from mobile devices, sensors and machines (the Internet of Things - IoT). The projects in the book show how to process such streams to gain valuable, timely, and actionable insights. Traditional use cases, such as ETL, that currently consume a significant chunk of data engineering resources are also covered. The final chapter shows you future possibilities emerging in the streaming space, and how Apache Apex can contribute to it.
Table of Contents (17 chapters)
Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Unbounded data and continuous processing


Datasets can be classified as unbounded or bounded. Bounded data is finite; it has a beginning and an end. Unbounded data is an ever-growing, essentially infinite data  set. The distinction is independent of how the data is processed. Often, unbounded data is equated to stream processing and bounded data to batch processing, but this is starting to change. We will see how state-of-the-art stream processors, such as Apache Apex, can be used to (and are very capable of) processing both unbounded and bounded data, and there is no need for a batch processing system just because the data set happens to be finite.

Note

For more details on these data processing concepts, you can visit the following link: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101.

Most big datasets (high volume) that are eventually processed by big data systems are unbounded. There is a rapidly increasing volume of such infinite data from sources such as IoT sensors (such as industrial gauge sensors, automobile data ports, connected home, and quantified self), stock markets and financial transactions, telecommunications towers and satellites, and so on. At the same time, the legacy processing and storage systems are either nearing performance and capacity limits, or total cost of ownership (TCO) is becoming prohibitive.

Businesses need to convert the available data into meaningful insights and make data-driven, real-time decisions to remain competitive.

Organizations are increasingly relying on very fast processing (high velocity), as the value of data diminishes as it ages:

How were these unbounded datasets processed without streaming architecture?

To be consumable by a batch processor, they had to be divided into bounded data, often at intervals of hours. Before processing could begin, the earliest events would wait for a long time for their batch to be ready. At the time of processing, data would already be old and less valuable.

Stream processing

Stream processing means processing event by event, as soon as it is available. Because there is no waiting for more input after an event arrives, there is no artificially added latency (unlike with batching). This is important for real-time use cases, where information should be processed and results available with minimum latency or delay. However, stream processing is not limited to real-time data. We will see there are benefits to applying this continuous processing in a uniform manner to historical data as well.

Consider data that is stored in a file. By reading the file line by line and processing each line as soon as it is read, subsequent processing steps can be performed while the file is still being read, instead of waiting for the entire input to be read before initiating the next stage. Stream processing is a pipeline, and each item can be acted upon immediately. Apart from low latency, this can also lead to even resource consumption (memory, CPU, network) with steady (versus bursty) throughput, when operations performed inherently don't require any blocking:

An example of a data pipeline

Data flows through the pipeline as individual events, and all processing steps are active at the same time. In a distributed system, operations are performed on different nodes and data flows through the system, allowing for parallelism and high throughput. Processing is decentralized and without inherent bottlenecks, in contrast to architectures that attempt to move processing to where the data resides.

Stream processing is a natural fit for how events occur in the real world. Sources generate data continuously (mobile devices, financial transactions, web traffic, sensors, and so on). It therefore makes sense to also process them that way instead of artificially breaking the processing into batches (or micro-batches).

The meaning of real time, or time for fast decision making, varies significantly between businesses. Some use cases, such as online fraud detection, may require processing to complete within milliseconds, but for others multiple seconds or even minutes might be sufficiently fast. In any case, the underlying platform needs to be equipped for fast and correct low-latency processing.

Streaming applications can process data fast, with low latency. Stream processing has gained popularity along with growing demand for faster processing of current data, but it is not a synonym for real-time processing. Input data does not need to be real-time. Older data can also be processed as stream (for example, reading from a file) and results are not always emitted in real-time either. Stream processing can perform operations such as sum, average, or top, that are performed over multiple events before the result becomes available.

To perform such operations, the stream needs to be sliced at temporal boundaries. This is called windowing. It demarcates finite datasets for computations. All data belonging to a window needs to be observed before a result can be emitted and windowing provides these boundaries. There are different strategies to define such windows over a data stream, and these will be covered in Chapter 3, The Apex Library:

Windowing of a stream

In the preceding diagram we see the sum of incoming readings computed over tumbling (non-overlapping) and sliding (overlapping) windows. At the end of each window, the result is emitted.

With windowing, the final result of an operation for a given window is only known after all its data elements are processed. However, many windowed operations still benefit from event-by-event arrival of data and incremental computation. Windowing doesn't always mean that processing can only start once all input has arrived. In our example, the sum can be updated whenever the next event arrives vs. storing all individual events and deferring computation until the end of the window. Sometimes, even the intermediate result of a windowed computation is of interest and can be made available for downstream consumption and subsequently refined with the final result.

Stream processing systems

The first open source stream processing framework in the big data ecosystem was Apache Storm. Since then, several other Apache projects for stream processing have emerged. Next-generation streaming first architectures such as Apache Apex and Apache Flink come with stronger capabilities and are more broadly applicable. They are not only able to process data with low latency, but also provide for state management (for data that an operation may require across individual events), strong processing guarantees (correctness), fault tolerance, scalability, and high performance.

Users can now also expect such frameworks to come with comprehensive libraries of connectors, other building blocks and APIs that make development of non-trivial streaming applications productive and allow for predictable project implementation cycles. Equally importantly, next-generation frameworks should cater to aspects such as operability, security, and the ability to run on shared infrastructure (multi-tenancy) to satisfy DevOps requirements for successful production launch and uptime.

Streaming can do it all!

Limitations of early stream processing systems lead to the so-called Lambda Architecture, essentially a parallel setup of stream and batch processing path to obtain fast but potentially unreliable results through the stream processor and, in parallel, correct but slow results through a batch processing system like Apache Hadoop MapReduce:

The fast processing path in the preceding diagram can potentially produce incorrect results, hence the need to re-compute the same results in an alternate batch processing path. Correctness issues are caused by previous technical limitations of stream processing, not by the paradigm itself. For example, if events are processed multiple times or lost, it leads to double or under counting, which would be a problem for an application that relies on accurate results, for example, in the financial sector.

This setup requires the same functionality to be implemented with two different frameworks, as well as extra infrastructure and operational skills, and therefore, results in longer time to production and higher Total Cost of Ownership (TOC). With recent advances in stream processing, Lambda Architecture is no longer necessary. Instead, a unified streaming architecture can be used for reliable processing in a much more TOC effective solution.

This approach based on a single system was outlined in 2014 as Kappa Architecture, and today there are several stream processing technology options, including Apache Apex, that support batch as a special case of streaming.

Note

To know more about the Kappa Architecture, please refer to following link: https://www.oreilly.com/ideas/questioning-the-lambda-architecture.

These newer systems are fault-tolerant, produce correct results, can achieve low latency as well as high throughput, and provide options for enterprise-grade operability and support. Potential users are no longer confronted with the shortcomings that previously justified a parallel batch processing system. We will later see how Apache Apex ensures correct processing, including its support for exactly-once processing.

What is Apex and why is it important?

Apache Apex (http://apex.apache.org/) is a stream processing platform and framework that can process data in-motion with low latency in a way that is highly scalable, highly performant, fault-tolerant, stateful, secure, distributed, and easily operable. Apex is written in Java, and Java is the primary application development environment.

In a typical streaming data pipeline, events from sources are stored and transported through a system such as Apache Kafka. The events are then processed by a stream processor and the results delivered to sinks, which are frequently databases, distributed file systems or message buses that link to downstream pipelines or services.

The following figure illustrates this:

In the end-to-end scenario depicted in this illustration, we see Apex as the processing component. The processing can be complex logic, with operations performed in sequence or in parallel in a distributed environment.

Apex runs on cluster infrastructure and currently supports and depends on Apache Hadoop, for which it was originally written. Support for Apache Mesos and other Docker-based infrastructure is on the roadmap.

Apex supports integration with many external systems out of the box, with connectors that are maintained and released by the project, including but not limited to the systems shown in the preceding diagram. The most frequently used connectors include Kafka and file readers. Frequently used sinks for the computed results are files and databases, though results can also be delivered directly to frontend systems for purposes such as real-time reporting directly from the Apex application, a use case that we will look at later.

Note

Origin of Apex

The development of the Apex project started in 2012, with the original vision of enabling fast, performant, and scalable real-time processing on Hadoop. At that time, batch processing and MapReduce-based frameworks such as Apache Pig, Hive, or Cascading were still the standard options for processing data. Hadoop 2.x with YARN (Yet Another Resource Negotiator) was about to be released to pave the way for a number of new processing frameworks and paradigms to become available as alternatives to MapReduce. Due to its roots in the Hadoop ecosystem, Apex is very well integrated with YARN, and since its earliest days has offered features such as dynamic resource allocation for scaling and efficient recovery. It is also leading in high performance (with low latency), scalability and operability, which were focus areas from the very beginning.

The technology was donated to the Apache Software Foundation (ASF) in 2015, at which time it entered the Apache incubation program and graduated after only eight months to achieve Top Level Project status in April 2016.

Apex had its first production deployments in 2014 and today is used in mission-critical deployments in various industries for processing at scale. Use cases range from very low-latency processing in the real-time category to large-scale batch processing; a few examples will be discussed in the next section. Some of the organizations that use Apex can be found on the Powered by Apache Apex page on the Apex project web site at https://apex.apache.org/powered-by-apex.html.