Book Image

Building Big Data Pipelines with Apache Beam

By : Jan Lukavský
Book Image

Building Big Data Pipelines with Apache Beam

By: Jan Lukavský

Overview of this book

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing. This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors. By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.
Table of Contents (13 chapters)
1
Section 1 Apache Beam: Essentials
5
Section 2 Apache Beam: Toward Improving Usability
9
Section 3 Apache Beam: Advanced Concepts

Task 14 – Implementing SQLMaxWordLength

In Chapter 2, Implementing, Testing, and Deploying Basic Pipelines, we implemented a pipeline called MaxWordLength. In this task, we will reimplement it by using SQL and schemas. Note that although we already know how to structure code better and use PTransforms rather than using static methods to transform one PCollection into another, we will keep the approach from the original chapter so that we can easily spot the differences and compare both versions more easily.

For clarity, let's restate the problem.

Problem definition

Given a stream of text lines in Apache Kafka, create a stream consisting of the longest word seen in the stream from the beginning to the present. Use triggering to output the result as frequently as possible. Use Apache Beam SQL to implement the task whenever possible.

Problem decomposition discussion

The interesting parts will be centered around several problems that we need to solve:

    ...