Book Image

Building Big Data Pipelines with Apache Beam

By : Jan Lukavský
Book Image

Building Big Data Pipelines with Apache Beam

By: Jan Lukavský

Overview of this book

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing. This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors. By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.
Table of Contents (13 chapters)
1
Section 1 Apache Beam: Essentials
5
Section 2 Apache Beam: Toward Improving Usability
9
Section 3 Apache Beam: Advanced Concepts

Introducing and using cross-language pipelines

Cross-language pipelines are a natural concept that comes with Beam's portability. Every executed PTransform in a pipeline has an associated environment. This environment describes how (DOCKER, EXTERNAL, PROCESS) and what (the Python SDK, Java SDK, Go SDK, and so on) should be executed by the Runner so that the pipeline behaves as intended by the pipeline author. Most of the time, all PTransforms in a single pipeline share the same SDK and the same environment. This doesn't necessarily have to be a rule and – when we view this via the optics of the Runner only, the Runner does not care if it executes a Python transform or a Java transform. The Runner code is already written in an (SDK) language-agnostic way, so it should not make any difference.

The first thing we must understand is how is the portable pipeline is represented. When an SDK builds and starts to execute a pipeline, it first compiles it into a portable...