Book Image

Building Big Data Pipelines with Apache Beam

By : Jan Lukavský
Book Image

Building Big Data Pipelines with Apache Beam

By: Jan Lukavský

Overview of this book

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing. This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors. By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.
Table of Contents (13 chapters)
1
Section 1 Apache Beam: Essentials
5
Section 2 Apache Beam: Toward Improving Usability
9
Section 3 Apache Beam: Advanced Concepts

Chapter 4: Structuring Code for Reusability

We have already walked through a great deal of the Apache Beam programming model, but we haven't investigated one of its core primitives – PTransform. We have seen many particular instances of PTransforms, but what if we wanted to implement our own? And should we even do that in the first place? In this chapter, we will explain how exactly Apache Beam builds the Directed Acyclic Graph (DAG) of operations, and we will use this knowledge to build a Domain Specific Language (DSL) to solve a specific use case that uses less boilerplate code than just by using plain Apache Beam. Then, we will introduce some of the built-in DSLs of Apache Beam. Last, but not least, we will learn how to view a stream of data as a time-varying relation, which is a fancy term for a table changing in time, which will help us establish a base to introduce one additional DSL – SQL. That will be the topic of Chapter 5, Using SQL for Pipeline Implementation...