Book Image

Building Big Data Pipelines with Apache Beam

By : Jan Lukavský
Book Image

Building Big Data Pipelines with Apache Beam

By: Jan Lukavský

Overview of this book

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing. This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors. By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.
Table of Contents (13 chapters)
1
Section 1 Apache Beam: Essentials
5
Section 2 Apache Beam: Toward Improving Usability
9
Section 3 Apache Beam: Advanced Concepts

Further development of Apache Beam SQL

In this section, we will sum up the possible further development of Apache Beam SQL and what parts are currently expected to be missing or somewhat incomplete.

At the end of the previous chapter, we described the retract and upsert streams and defined time-varying relations on top of these streams. Although Apache Beam does contain generic retractions as part of its model, they are not implemented at the moment. The same is true for SQL. Among other things, it implies that Apache Beam SQL currently does not support full stream-to-stream joins, only windowed joins.

A windowed join, by itself, does not guarantee that retractions will not be needed, but when using a default trigger without allowed lateness – or a trigger that fires past the end of the window, plus allowed lateness only –no retractions are needed. The reason for this is that all the data is projected onto the timestamp at the end of the window, and the window ends...