Chapter 4: Data Engineering with Apache Spark | Data Engineering with Azure Databricks

Book Overview & Buying
Table Of Contents

Data Engineering with Azure Databricks

By : Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, Sergii Volodarskyi

Buy this Book

Data Engineering with Azure Databricks

By: Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, Sergii Volodarskyi

Buy this Book

Overview of this book

"Data Engineering with Azure Databricks" is your essential guide to building scalable, secure, and high-performing data pipelines using the powerful Databricks platform on Azure. Designed for data engineers, architects, and developers, this book demystifies the complexities of Spark-based workloads, Delta Lake, Unity Catalog, and real-time data processing. Beginning with the foundational role of Azure Databricks in modern data engineering, you’ll explore how to set up robust environments, manage data ingestion with Auto Loader, optimize Spark performance, and orchestrate complex workflows using tools like Azure Data Factory and Airflow. The book offers deep dives into structured streaming, Delta Live Tables, and Delta Lake’s ACID features for data reliability and schema evolution. You’ll also learn how to manage security, compliance, and access controls using Unity Catalog, and gain insights into managing CI/CD pipelines with Azure DevOps and Terraform. With a special focus on machine learning and generative AI, the final chapters guide you in automating model workflows, leveraging MLflow, and fine-tuning large language models on Databricks. Whether you're building a modern data lakehouse or operationalizing analytics at scale, this book provides the tools and insights you need.

Preface

Free benefits with your book

Free Chapter

Chapter 1: The Role of Azure Databricks in Modern Data Engineering

The evolution of data engineering

Meet databricks

Databricks and competitors

Reference architectures and use cases

Summary

Chapter 2: Setting up an End-To-End Azure Databricks Environment

Governance foundation: Unity catalog

Administrative access and workspace provisioning

Connecting to cloud storage

Bronze, silver, gold tables

Identity management

Summary

Chapter 3: Data Ingestion Strategies for Azure Databricks

Technical requirements

Understanding batch ingestion

Ingesting data from Azure Storage (ADLS and blob storage)

Using Azure Data Factory for ingestion

Connecting to relational databases

Performance optimization tips

Azure Synapse Analytics

Ingesting data from REST APIs

Other data sources

Summary

Chapter 4: Data Engineering with Apache Spark

Apache Spark architecture and execution model

Memory configuration and optimization

Optimizing spark jobs with partitioning and caching

Writing efficient PySpark and scala code

Handling Large-Scale data processing with spark

Summary

Chapter 5: Building Real-Time Data Pipelines

What is Data Streaming?

Types of Streaming: Real-Time vs.Near Real-Time

Azure Databricks and Spark Streaming Capabilities

Building Your First Streaming Pipeline

Advanced Streaming Concepts

Summary

Chapter 6: Working with Delta Lake: ACID Transactions and Schema Evolution

Technical requirements

Introduction to delta lake: Why it matters

Enabling ACID transactions for reliable data processing

Schema enforcement and evolution

Why schema management matters

Delta versioning and time travel

Change data capture in Delta Lake

How VACUUM affects change data feed

Summary

Chapter 7: Automating Data Systems with Lakeflow Spark Declarative Pipelines

Technical requirements

Introduction to LakeFlow Spark Declarative Pipelines

The Challenge with Traditional Pipelines

Building your first Lakeflow pipeline

Managing data quality with expectations

Monitoring quality metrics

Best practices for data quality

Optimizing Lakeflow pipeline performance

Summary

Chapter 8: Orchestrating Data Workflows: From Notebooks to Production

Technical Requirements

Using Lakeflow Jobs for Task Scheduling

Integrating Databricks with Azure Data Factory and Airflow

Azure Data Factory Integration Architecture

Best Practices for Modularizing Notebooks

Summary

Chapter 9: CI/CD and DevOps for Azure Databricks

Technical Requirements

DevOps Practices in Databricks

Setting Up Git Integration for Databricks

Infrastructure as Code with Terraform

Declarative Automation Bundles

Summary

Chapter 10: Optimizing Query Performance and Cost Management

Technical Requirements

Performance Tuning for Spark and Delta Lake

Liquid Clustering: The Modern Approach

Adaptive Query Execution and Caching Techniques

Managing Cluster Costs and Autoscaling Strategies

Monitoring and Debugging Performance Bottlenecks

Summary

Chapter 11: Security, Compliance, and Data Governance

Technical requirements

Securing the Databricks environment

Fine-Grained access control with Unity Catalog

Auditing and monitoring data access

Ensuring compliance with regulatory standards

Data lineage and impact analysis

Summary

Chapter 12: Machine Learning and AI on Databricks

Technical Requirements

Introduction to Databricks AI/ML innovation and capabilities

ML and AI Capabilities Overview

MLflow: ML Experiment Tracking and Model Management

Business Use Case Ideas

Feature Store: Centralized Feature Management

Genie: Conversational Analytics for Business Users

What is Vector Search?

What is RAG?

AI Gateway: Governance and Cost Control

Summary

Chapter 13: Unlock Access to the Code Bundle and the PDF Version

Unlock this book's free benefits in three easy steps

Index

Data Engineering with Azure Databricks

By : Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, Sergii Volodarskyi

Data Engineering with Azure Databricks

By: Dmitry Foshin, Dmitry Anoshin, Tonya Chernyshova, Sergii Volodarskyi

Overview of this book

Summary

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access