Chapter 4: Understanding Data Pipelines | Data Engineering with Apache Spark, Delta Lake, and Lakehouse

Book Overview & Buying
Table Of Contents

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

By : Manoj Kukreja

4.7 (58)

Buy this Book

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

4.7 (58)

By: Manoj Kukreja

Buy this Book

Overview of this book

In the world of ever-changing data and schemas, it is important to build data pipelines that can auto-adjust to changes. This book will help you build scalable data platforms that managers, data scientists, and data analysts can rely on. Starting with an introduction to data engineering, along with its key concepts and architectures, this book will show you how to use Microsoft Azure Cloud services effectively for data engineering. You'll cover data lake design patterns and the different stages through which the data needs to flow in a typical data lake. Once you've explored the main features of Delta Lake to build data lakes with fast performance and governance in mind, you'll advance to implementing the lambda architecture using Delta Lake. Packed with practical examples and code snippets, this book takes you through real-world examples based on production scenarios faced by the author in his 10 years of experience working with big data. Finally, you'll cover data lake deployment strategies that play an important role in provisioning the cloud resources and deploying the data pipelines in a repeatable and continuous way. By the end of this data engineering book, you'll know how to effectively deal with ever-changing data and create scalable data pipelines to streamline data science, ML, and artificial intelligence (AI) tasks.

Preface

Who this book is for

What this book covers

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1: Modern Data Engineering and Tools

Free Chapter

Chapter 1: The Story of Data Engineering and Analytics

The journey of data

Exploring the evolution of data analytics

The monetary power of data

Summary

Chapter 2: Discovering Storage and Compute Data Lakes

Introducing data lakes

Discovering data lake architectures

Summary

Chapter 3: Data Engineering on Microsoft Azure

Introducing data engineering in Azure

Performing data engineering in Microsoft Azure

Opening a free account with Microsoft Azure

Summary

Section 2: Data Pipelines and Stages of Data Engineering

Chapter 4: Understanding Data Pipelines

Exploring data pipelines

Process of creating a data pipeline

Running a data pipeline

Sample lakehouse project

Summary

Chapter 5: Data Collection Stage – The Bronze Layer

Architecting the Electroniz data lake

Understanding the bronze layer

Configuring data sources

Configuring data destinations

Building the ingestion pipelines

Summary

Chapter 6: Understanding Delta Lake

Understanding how Delta Lake enables the lakehouse

Understanding Delta Lake

Creating a Delta Lake table

Changing data in an existing Delta Lake table

Performing time travel

Performing upserts of data

Understanding isolation levels

Understanding concurrency control

Cleaning up Azure resources

Summary

Chapter 7: Data Curation Stage – The Silver Layer

The need for curating raw data

The process of curating raw data

Developing a data curation pipeline

Running the pipeline for the silver layer

Verifying curated data in the silver layer

Cleaning up Azure resources

Summary

Chapter 8: Data Aggregation Stage – The Gold Layer

The need to aggregate data

The process of aggregating data

Developing a data aggregation pipeline

Running the aggregation pipeline

Understanding data consumption

Verifying aggregated data in the gold layer

Meeting customer expectations

Summary

Section 3: Data Engineering Challenges and Effective Deployment Strategies

Chapter 9: Deploying and Monitoring Pipelines in Production

The deployment strategy

Developing the master pipeline

Testing the master pipeline

Scheduling the master pipeline

Monitoring pipelines

Summary

Chapter 10: Solving Data Engineering Challenges

Schema evolution

Sharing data

Data governance

Cleaning up Azure resources

Summary

Chapter 11: Infrastructure Provisioning

Infrastructure as code

Deploying infrastructure using Azure Resource Manager

Deploying multiple environments using IaC

Cleaning up Azure resources

Summary

Chapter 12: Continuous Integration and Deployment (CI/CD) of Data Pipelines

Understanding CI/CD

Designing CI/CD pipelines

Developing CI/CD pipelines

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

By : Manoj Kukreja

Data Engineering with Apache Spark, Delta Lake, and Lakehouse

By: Manoj Kukreja

Overview of this book

Summary

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access