Book Image

Learning Apache Apex

By : Thomas Weise, Ananth Gundabattula, Munagala V. Ramanath, David Yan, Kenneth Knowles
Book Image

Learning Apache Apex

By: Thomas Weise, Ananth Gundabattula, Munagala V. Ramanath, David Yan, Kenneth Knowles

Overview of this book

Apache Apex is a next-generation stream processing framework designed to operate on data at large scale, with minimum latency, maximum reliability, and strict correctness guarantees. Half of the book consists of Apex applications, showing you key aspects of data processing pipelines such as connectors for sources and sinks, and common data transformations. The other half of the book is evenly split into explaining the Apex framework, and tuning, testing, and scaling Apex applications. Much of our economic world depends on growing streams of data, such as social media feeds, financial records, data from mobile devices, sensors and machines (the Internet of Things - IoT). The projects in the book show how to process such streams to gain valuable, timely, and actionable insights. Traditional use cases, such as ETL, that currently consume a significant chunk of data engineering resources are also covered. The final chapter shows you future possibilities emerging in the streaming space, and how Apache Apex can contribute to it.
Table of Contents (17 chapters)
Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Use cases and case studies


Apex is a platform and framework on top of which specific applications (or solutions) are built.

As such, Apex is applicable to to a wide range of use cases, including real-time machine learning model scoring, real-time ETL (Extract, Transform, and Load), predictive analytics, anomaly detection, real-time recommendations, and systems monitoring:

As organizations realize the financial and competitive importance of making data-driven decisions in real time, the number of industries and use cases will grow.

In the remainder of this section, we will discuss how companies in various industries are using Apex to solve important problems. These companies have presented their particular use cases, implementations and findings at conferences and meetups, and references to this source material are provided with each case study when available.

Real-time insights for Advertising Tech (PubMatic)

Companies in the advertising technology (AdTech) industry need to address data increasing at breakneck speed, along with customers demanding faster insights and analytical reporting.

PubMatic is a leading AdTech company providing marketing automation for publishers and is driven by data at a massive scale. On a daily basis, the company processes over 350 billion bids, serves over 40 billion ad impressions, and processes over 50 terabytes of data. Through real-time analytics, yield management, and workflow automation, PubMatic enables publishers to make smarter inventory decisions and improve revenue performance. Apex is used for real-time reporting and for the allocation engine.

In PubMatic's legacy batch processing system, there could be a delay of five hours to obtain updated data for their key metrics (revenues, impressions and clicks) and a delay of nine hours to obtain data for auction logs.

PubMatic decided to pursue a real-time streaming solution so that it could provide publishers, demand side platforms (DSPs), and agencies with actionable insights as close to the time of event generation as possible. PubMatic's streaming implementation had to achieve the following:

  • Ingest and analyze a high volume of clicks and views (200,000 events/sec) to help their advertising customers improve revenues
  • Utilize auction and client log data (22 TB/day) to report critical metrics for campaign monetization
  • Handle rapidly increasing network traffic with efficient utilization of resources
  • Provide a feedback loop to the ad server for making efficient ad serving decisions.

This high volume data would need to be processed in real-time to derive actionable insights, such as campaign decisions and audience targeting.

PubMatic decided to implement its real-time streaming solution with Apex based on the following factors:

  • Time to value - the solution was able to be implemented within a short time frame
  • The Apex applications could run on PubMatic's existing Hadoop infrastructure
  • Apex had important connectors (files, Apache Kafka, and so on) available out of the box
  • Apex supported event time dimensional aggregations with real-time query capability

With the Apex-based solution, deployed to production in 2014, PubMatic's end-to-end latency to obtain updated data and metrics for their two use cases fell from hours to seconds. This enabled real-time visibility into successes and shortcomings of its campaigns and timely tuning of models to maximize successful auctions.

Note

Additional Resources

Industrial IoT applications (GE)

General Electric (GE) is a large, diversified company with business units in energy, power, aviation, transportation, healthcare, finance, and other industries. Many of these business units deal in industrial machinery and devices such as wind turbines, aviation components, locomotive components, healthcare imaging machines, and so on. Such industrial devices continually generate high volumes of real-time data, and GE decided to provide advanced IoT analytics solutions to the thousands of customers using these devices and sensors across its various business units and industries.

The GE Predix platform enables users to develop and execute Industrial IoT applications to gain real-time insights about their devices and their usage, as well as take actions based on these insights. Certain services offered by Predix are powered by Apache Apex. GE selected Apex for these services based on the following features (feature details will be covered later in this book):

  • High performance and distributed computing
  • Dynamic partitioning
  • Rich library of existing operators
  • Support for at-least-once, at-most-once, and exactly-once processing
  • Hadoop/YARN compatibility
  • Fault tolerance and platform stability
  • Ease of deployment and operability
  • Enterprise grade security

One Predix service that runs on Apex is the Time Series service, which leverages Apex due to its speed, scalability, high performance, and fault tolerance capabilities.

The service provides:

  • Efficient storage of time series data
  • Data indexing for quick retrieval
  • Industrial focused query modes
  • High availability and horizontal scalability
  • Millisecond data point precision

By running Apex, users of the Time Series service are able to:

  • Ingest and analyze high-volume, high speed data from thousands of devices, sensors per customer in real-time without data loss
  • Run predictive analytics to reduce costly maintenance and improve customer service
  • Conduct unified monitoring of all connected sensors and devices to minimize disruptions
  • Have fast application development cycles
  • Meet changing business and application workloads due to Apex's high scalability

Another Predix service leveraging Apex is the Stream Processing service, which provides predefined flows to support data conversion, manipulation, or processing of large volumes of real-time data before delivering it to the event hub or storage layer. This service provides the following capabilities to users:

  • Raw data ingestion
  • Fault tolerance, allowing data to be processed despite machine or node failures
  • Apex as the runtime engine (Spark and other engines will be supported in future releases)
  • Multi-tenancy support
  • Security (UAA integrated)

Apex's integration into the GE Predix platform and ability to be used across a broad spectrum of industrial devices and Industrial IOT use cases speaks volumes about Apex and its capabilities.

Note

Additional Resources

Real-time threat detection (Capital One)

Capital One is currently the eighth largest bank in the U.S. One of its core areas of business was facing vast and increasing costs for an existing solution to guard against digital threats. The bank set out to find a new solution that would deliver better performance while also being more cost effective.

At the time, Capital One was processing several thousand transactions every second. The bank's innovation team established that the solution must be able to process data within low double-digit milliseconds latency, scale easily, ensure that it runs internal algorithms with zero data loss, and also be highly available. Additionally, the team realized that tackling this challenge would require dynamic and flexible machine learning algorithms in a real-time distributed environment.

The team launched a rigorous process of evaluating numerous streaming technologies including Apache Apex, Apache Flink, Apache Storm, Apache Spark Streaming, IBM Infosphere Streams, Apache Samza, Apache Ignite, and others. The evaluation process involved developing parallel solutions using each of the technologies, and comparing the quantitative results generated by each technology as well as its qualitative characteristics.

At the conclusion of the evaluation, only one technology emerged as being able to meet all of Capital One's requirements. In the team's own words:

"Of all evaluated technologies, Apache Apex is the only technology that is ready to bring the decision making solution to production based on: Maturity, Fault Tolerance, Enterprise-Readiness, and Performance."

With Apache Apex, Capital One was able to:

  • Achieve latency in single-digit milliseconds, which is significantly lower than the double digit millisecond latency that the bank set out to achieve and which is a hard requirement for use cases such as online transactions
  • Meet the SLA requirements of continuously running the data pipeline applications with 99.999% uptime on 24x7 basis, with automatic failover
  • Reduce the total cost of ownership, based on Apex's ability to run on Hadoop and scale out with commodity grade hardware
  • Easily add newer applications and features to accurately detect suspicious events without being tied to the vendor roadmap and timeline
  • Focus on core business algorithms and innovation, while the platform took care of fault tolerance, operability, scalability, and performance

Furthermore, Capital One's implementation of Apex enabled the following:

  • Parallel Model Scoring
  • Dynamic Scalability based on Throughput or Latency
  • Live Model Refresh, parallelized model scoring

Parameter

Capital One's Goal

Result With Apex

Latency

< 40 milliseconds

0.25 milliseconds

Throughput

2,000 events/sec

70,000 events/sec

Durability

No loss, every message gets exactly one response

Yes

Availability

99.5% uptime, ideally 99.999% uptime

99.99925% uptime

Scalability

Can add resources and still meet latency requirements

Yes

Integration

Transparently connected to existing systems: Hardware, Messaging, HDFS

Yes

Open Source

All components licensed as open source

Yes

Extensibility

Rules can be updated, Model is regularly refreshed

Yes

A complete set of Capital One's goals, and the results it achieved with Apex

Note

Additional Resources

Silver Spring Networks (SSN)

Silver Spring Networks (SSN) helps global utilities and cities connect, optimize, and manage smart energy and smart city infrastructure. It provides smart grid products and also develops software for utilities and customers to improve their energy efficiency. SSN is one of the world's largest IOT companies, receiving data from over 22 million smart meters and connected devices, reading over 200 billion records per year, and conducting over two million remote operations per year.

As SSN's network and volume, variety, and velocity of data began to grow, it started to ponder:

  • How to obtain more value out of its network of connected devices
  • How to manage the growing number of devices, access their data, and ensure the safety of their data
  • How to integrate with third party data applications quickly

SSN's answer to these questions would be informed by its needs, which included:

  • A broad variety of incoming data, including sensor data, meter data, interval data, device metadata, threshold events, and traps
  • Multi-tenancy and shared resources to save costs, with centralized management of software and applications
  • Security, including encryption of both data-at-rest and data-in-motion, auditing of data, and no loss of data across tenants
  • Ability to scale, based on the millions of connected devices in its network, as well as over eight billion events per day and volume of over 500 GB each day
  • High availability and disaster recoverability of its cluster, with automated failovers as well as rolling upgrades

SSN chose Apex as its solution due to the following factors:

  • The availability of pre-existing and prebuilt operators as part of the Apex Malhar library
  • The ability to develop applications quickly
  • Apex's operability and auto-scaling capabilities
  • Apex's partitioning capabilities, leading to scalability
  • Java programmers are able to learn Apex application development quickly
  • Operations are handled by Apex and don't require hands on management

In addition to meeting SSN's requirements, Apex was able to make SSN data accessible to applications without delay to improve customer service and was able to capture and analyze historical data to understand and improve grid operations.

Note

Additional Resources