Example Project – ETL Using SQL | Learning Apache Apex

Sign In Start Free Trial

Book Overview & Buying
Table Of Contents

Learning Apache Apex

By : Ananth Gundabattula, Thomas Weise, Munagala V. Ramanath, David Yan, Kenneth Knowles

5 (1)

Learning Apache Apex

5 (1)

By: Ananth Gundabattula, Thomas Weise, Munagala V. Ramanath, David Yan, Kenneth Knowles

Overview of this book

Apache Apex is a next-generation stream processing framework designed to operate on data at large scale, with minimum latency, maximum reliability, and strict correctness guarantees. Half of the book consists of Apex applications, showing you key aspects of data processing pipelines such as connectors for sources and sinks, and common data transformations. The other half of the book is evenly split into explaining the Apex framework, and tuning, testing, and scaling Apex applications. Much of our economic world depends on growing streams of data, such as social media feeds, financial records, data from mobile devices, sensors and machines (the Internet of Things - IoT). The projects in the book show how to process such streams to gain valuable, timely, and actionable insights. Traditional use cases, such as ETL, that currently consume a significant chunk of data engineering resources are also covered. The final chapter shows you future possibilities emerging in the streaming space, and how Apache Apex can contribute to it.

Preface

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

Introduction to Apex

Introduction to Apex

Unbounded data and continuous processing

Use cases and case studies

Application Model and API

Value proposition of Apex

Summary

Getting Started with Application Development

Getting Started with Application Development

Development process and methodology

Setting up the development environment

Creating a new Maven project

Application specifications

Custom operator development

Application configuration

Testing in the IDE

Running the application on YARN

Working on the cluster

Summary

The Apex Library

The Apex Library

An overview of the library

Integrations

Transformations

Summary

Scalability, Low Latency, and Performance

Scalability, Low Latency, and Performance

Partitioning and how it works

Elasticity

Partitioning toolkit

Custom dynamic partitioning

Performance optimizations

Low-latency versus throughput

Sample application for dynamic partitioning

Performance – other aspects for custom operators

Summary

Fault Tolerance and Reliability

Fault Tolerance and Reliability

Distributed systems need to be resilient

Fault-tolerance components and mechanism in Apex

Checkpointing

Processing guarantees

Summary

Example Project – Real-Time Aggregation and Visualization

Example Project – Real-Time Aggregation and Visualization

Streaming ETL and beyond

The application pattern in a real-world use case

Analyzing Twitter feed

Running the application

The Pub/Sub server

Grafana visualization

Summary

Example Project – Real-Time Ride Service Data Processing

Example Project – Real-Time Ride Service Data Processing

The goal

Datasource

The pipeline

Simulation of a real-time feed using historical data

Parsing the data

Looking up of the zip code and preparing for the windowing operation

Windowed operator configuration

Serving the data with WebSocket

Running the application

Running the application on GCP Dataproc

Summary

Example Project – ETL Using SQL

Example Project – ETL Using SQL

The application pipeline

Building and running the application

Application configuration

The application code

Partitioning

Application testing

Understanding application logs

Calcite integration

Summary

Introduction to Apache Beam

Introduction to Apache Beam

Introduction to Apache Beam

Beam concepts

WordCount in Apache Beam

Running Apache Beam WordCount on Apache Apex

Summary

The Future of Stream Processing

The Future of Stream Processing

Lower barrier for building streaming pipelines

Summary

Partitioning

As discussed in Chapter 4, Scalability, Low Latency, and Performance, stateless partitioning of a single operator can be accomplished by setting the PARTITIONER attribute. For the current example, we could partition CSVParser using the following configuration stanza:

<property>  <name>apex.operator.CSVParser.attr.PARTITIONER</name>  <value>com.datatorrent.common.partitioner.StatelessPartitioner:2</value></property>

If we want some section of the SQL pipeline to be partitioned in parallel, we can set the PARTITION_PARALLEL attribute on the input ports of the downstream operators in that section, as shown in this example:

<property>
  <name>apex.operator.LogicalFilter_1.inputport.input.attr.PARTITION_PARALLEL
    </name>  <value>true</value></property>

With these changes, the physical DAG of our application would look like this:

The application's physical DAG

CONTINUE READING

83

Tech Concepts

36

Programming languages

73

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Learning Apache Apex

Search

Your notes and bookmarks