Chapter 4: Scaling DLT Pipelines | Building Modern Data Applications Using Databricks Lakehouse

Book Overview & Buying
Table Of Contents

Building Modern Data Applications Using Databricks Lakehouse

By : Will Girten

5 (1)

Buy this Book

Building Modern Data Applications Using Databricks Lakehouse

5 (1)

By: Will Girten

Buy this Book

Overview of this book

With so many tools to choose from in today’s data engineering development stack as well as operational complexity, this often overwhelms data engineers, causing them to spend less time gleaning value from their data and more time maintaining complex data pipelines. Guided by a lead specialist solutions architect at Databricks with 10+ years of experience in data and AI, this book shows you how the Delta Live Tables framework simplifies data pipeline development by allowing you to focus on defining input data sources, transformation logic, and output table destinations. This book gives you an overview of the Delta Lake format, the Databricks Data Intelligence Platform, and the Delta Live Tables framework. It teaches you how to apply data transformations by implementing the Databricks medallion architecture and continuously monitor the data quality of your pipelines. You’ll learn how to handle incoming data using the Databricks Auto Loader feature and automate real-time data processing using Databricks workflows. You’ll master how to recover from runtime errors automatically. By the end of this book, you’ll be able to build a real-time data pipeline from scratch using Delta Live Tables, leverage CI/CD tools to deploy data pipeline changes automatically across deployment environments, and monitor, control, and optimize cloud costs.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Free Chapter

Part 1:Near-Real-Time Data Pipelines for the Lakehouse

Chapter 1: An Introduction to Delta Live Tables

Technical requirements

The emergence of the lakehouse

The maintenance predicament of a streaming application

What is the DLT framework?

How is DLT related to Delta Lake?

Introducing DLT concepts

A quick Delta Lake primer

A hands-on example – creating your first Delta Live Tables pipeline

Summary

Chapter 2: Applying Data Transformations Using Delta Live Tables

Technical requirements

Ingesting data from input sources

Applying changes to downstream tables

Publishing datasets to Unity Catalog

Data pipeline settings

Hands-on exercise – applying SCD Type 2 changes

Summary

Chapter 3: Managing Data Quality Using Delta Live Tables

Technical requirements

Defining data constraints in Delta Lake

Using temporary datasets to validate data processing

An introduction to expectations

Decoupling expectations from a DLT pipeline

Hands-on exercise – quarantining bad data for correction

Summary

Chapter 4: Scaling DLT Pipelines

Technical requirements

Scaling compute to handle demand

Hands-on example – setting autoscaling properties using the Databricks REST API

Automated table maintenance tasks

Optimizing table layouts for faster table updates

Serverless DLT pipelines

Introducing Enzyme, a performance optimization layer

Summary

Part 2:Securing the Lakehouse Using the Unity Catalog

Chapter 5: Mastering Data Governance in the Lakehouse with Unity Catalog

Technical requirements

Understanding data governance in a lakehouse

Enabling Unity Catalog on an existing Databricks workspace

Identity federation in Unity Catalog

Data discovery and cataloging

Hands-on example – data masking healthcare datasets

Summary

Chapter 6: Managing Data Locations in Unity Catalog

Technical requirements

Creating and managing data catalogs in Unity Catalog

Setting default locations for data within Unity Catalog

Isolating catalogs to specific workspaces

Creating and managing external storage locations in Unity Catalog

Hands-on lab – extracting document text for a generative AI pipeline

Summary

Chapter 7: Viewing Data Lineage Using Unity Catalog

Technical requirements

Introducing data lineage in Unity Catalog

Tracing data origins using the Data Lineage REST API

Visualizing upstream and downstream transformations

Identifying dependencies and impacts

Hands-on lab – documenting data lineage across an organization

Summary

Part 3:Continuous Integration, Continuous Deployment, and Continuous Monitoring

Chapter 8: Deploying, Maintaining, and Administrating DLT Pipelines Using Terraform

Technical requirements

Introducing the Databricks provider for Terraform

Setting up a local Terraform environment

Configuring DLT pipelines using Terraform

Automating DLT pipeline deployment

Hands-on exercise – deploying a DLT pipeline using VS Code

Summary

Chapter 9: Leveraging Databricks Asset Bundles to Streamline Data Pipeline Deployment

Technical requirements

Introduction to Databricks Asset Bundles

Databricks Asset Bundles in action

Hands-on exercise – deploying your first DAB

Hands-on exercise – simplifying cross-team collaboration with GitHub Actions

Versioning and maintenance

Summary

Chapter 10: Monitoring Data Pipelines in Production

Technical requirements

Introduction to data pipeline monitoring

Pipeline health and performance monitoring

Hands-on exercise – querying data quality events for a dataset

Data quality monitoring

Best practices for production failure resolution

Hands-on exercise – setting up a webhook alert when a job runs longer than expected

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Building Modern Data Applications Using Databricks Lakehouse

By : Will Girten

Building Modern Data Applications Using Databricks Lakehouse

By: Will Girten

Overview of this book

Introducing Enzyme, a performance optimization layer

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access