Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Building Big Data Pipelines with Apache Beam
  • Table Of Contents Toc
Building Big Data Pipelines with Apache Beam

Building Big Data Pipelines with Apache Beam

By : Jan Lukavský
3.7 (9)
close
close
Building Big Data Pipelines with Apache Beam

Building Big Data Pipelines with Apache Beam

3.7 (9)
By: Jan Lukavský

Overview of this book

Apache Beam is an open source unified programming model for implementing and executing data processing pipelines, including Extract, Transform, and Load (ETL), batch, and stream processing. This book will help you to confidently build data processing pipelines with Apache Beam. You’ll start with an overview of Apache Beam and understand how to use it to implement basic pipelines. You’ll also learn how to test and run the pipelines efficiently. As you progress, you’ll explore how to structure your code for reusability and also use various Domain Specific Languages (DSLs). Later chapters will show you how to use schemas and query your data using (streaming) SQL. Finally, you’ll understand advanced Apache Beam concepts, such as implementing your own I/O connectors. By the end of this book, you’ll have gained a deep understanding of the Apache Beam model and be able to apply it to solve problems.
Table of Contents (13 chapters)
close
close
1
Section 1 Apache Beam: Essentials
5
Section 2 Apache Beam: Toward Improving Usability
9
Section 3 Apache Beam: Advanced Concepts

Why Apache Beam?

There are two basic questions we might ask when considering a new technology to learn and apply in practice:

  • What problem am I struggling with that the new technology can help me solve?
  • What would the costs associated with the technology be?

Every sound technology has a well-defined selling point – that is, something that justifies its existence in the presence of competing technologies. In the case of Beam, this selling point could be reduced to a single word: portability. Beam is portable on several layers:

  • Beam's pipelines are portable between multiple runners (that is, a technology that executes the distributed computation described by a pipeline's author).
  • Beam's data processing model is portable between various programming languages.
  • Beam's data processing logic is portable between bounded and unbounded data.

Each of these points deserves a few words of explanation. By runner portability, we mean the possibility to run existing pipelines written in one of the supported programming languages (for instance, Java, Python, Go, Scala, or even SQL) against a data processing engine that can be chosen at runtime. A typical example of a runner would be Apache Flink, Apache Spark, or Google Cloud Dataflow. However, Beam is by no means limited to these; new runners are created as new technologies arise, and it's very likely that many more will be developed.

When we say Beam's data processing model is portable between various programming languages, we mean it has the ability to provide support for multiple SDKs, regardless of the language or technology used by the runner. This way, we can code Beam pipelines in the Go language, and then run these against the Apache Flink Runner, written in Java.

Last but not least, the core of Apache Beam's model is designed so that it is portable between bounded and unbounded data. Bounded data is what was historically called batch processing, while unbounded data refers to real-time processing (that is, an application crunching live data as it arrives in the system and producing a low-latency output).

Putting these pieces together, we can describe Beam as a tool that lets you deal with your big data architecture with the following vision:

Choose your preferred language, write your data processing pipeline, run this pipeline using a runner of your choice, and do all of this for both batch and real-time data at the same time.

Because everything comes at a price, you should expect to pay for flexibility like this – this price would be a somewhat bigger overhead in terms of CPU and/or memory usage. The Beam community works hard to make this overhead as small as possible, but the chances are that it will never be zero.

If all of this sounds compelling to you, then we are ready to start a journey exploring Apache Beam!

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Building Big Data Pipelines with Apache Beam
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon