Book Image

Building Big Data Pipelines with Apache Beam

By : Jan Lukavský
Book Image

Building Big Data Pipelines with Apache Beam

By: Jan Lukavský

Overview of this book

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing. This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors. By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.
Table of Contents (13 chapters)
1
Section 1 Apache Beam: Essentials
5
Section 2 Apache Beam: Toward Improving Usability
9
Section 3 Apache Beam: Advanced Concepts

Task 7 – Batching queries to an external RPC service

Let's imagine that the RPC service we used in Task 6 supports the batching of RPC queries. Batching is a technique for reducing network overhead by grouping multiple queries into a single one, thus increasing throughput. So, instead of querying our RPC service with each element, we would like to send multiple input elements in a single query.

Defining the problem

Given an RPC service that supports the batching of requests for increasing throughput, use this service to augment the input data of a PCollection object. Be sure to preserve the timestamp of both the timestamp and window assigned to the input element.

Discussing the problem decomposition

The first thing to notice is that unlike in Task 6, where we queried our RPC service with each element separately (and therefore, simply kept the timestamp and the window of the element untouched), in this case, we can have multiple elements with multiple timestamps...