Book Image

Building Big Data Pipelines with Apache Beam

By : Jan Lukavský
Book Image

Building Big Data Pipelines with Apache Beam

By: Jan Lukavský

Overview of this book

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing. This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors. By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.
Table of Contents (13 chapters)
1
Section 1 Apache Beam: Essentials
5
Section 2 Apache Beam: Toward Improving Usability
9
Section 3 Apache Beam: Advanced Concepts

Introducing the primitive PTransform object – Partition

The GroupByKey transform creates a set of sub-streams based on a dynamic property of the data – the set of keys of a particular window can be modified during the pipeline execution time. New keys can be created and processed at any time. This creates the complexity mentioned in the previous section – we need to store our data in keyed states and flush them on triggers. A question we might have is – would the task be easier if we knew the exact set of keys upfront, during pipeline construction time?

The answer is yes, and that is why we have a PTransform object called Partition.

Important note

A pipeline is generally divided into three phases during its life cycle: pipeline compile time, pipeline construction time, and pipeline execution time. Compile time refers (as usual) to the time we compile the source to bytecode. Construction time is the time when the pipeline's DAG of transformations...