Book Image

Building Big Data Pipelines with Apache Beam

By : Jan Lukavský
Book Image

Building Big Data Pipelines with Apache Beam

By: Jan Lukavský

Overview of this book

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing. This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors. By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.
Table of Contents (13 chapters)
1
Section 1 Apache Beam: Essentials
5
Section 2 Apache Beam: Toward Improving Usability
9
Section 3 Apache Beam: Advanced Concepts

Chapter 2: Implementing, Testing, and Deploying Basic Pipelines

Now that we are familiar with the basic concept of streaming data processing, in this chapter, we will take a deep dive into how to build something practical with Apache Beam.

The purpose of this chapter is to give you some hands-on experience of solving practical problems from start to finish. The chapter will be divided into subsections, with each following the same structure:

  1. Defining a practical problem
  2. Discussing the problem decomposition (and how to solve the problem using Beam's PTransform)
  3. Implementing a pipeline to solve the defined problem
  4. Testing and validating that we have implemented our pipeline correctly
  5. Deploying the pipeline, both locally and to a running cluster

During this process (mostly at Step 2), we will discuss the various possibilities provided by Beam for addressing the problem, and we will try to highlight any caveats or common issues you might run into...