Book Image

Azure Databricks Cookbook

By : Phani Raj, Vinod Jaiswal
Book Image

Azure Databricks Cookbook

By: Phani Raj, Vinod Jaiswal

Overview of this book

Azure Databricks is a unified collaborative platform for performing scalable analytics in an interactive environment. The Azure Databricks Cookbook provides recipes to get hands-on with the analytics process, including ingesting data from various batch and streaming sources and building a modern data warehouse. The book starts by teaching you how to create an Azure Databricks instance within the Azure portal, Azure CLI, and ARM templates. You’ll work through clusters in Databricks and explore recipes for ingesting data from sources, including files, databases, and streaming sources such as Apache Kafka and EventHub. The book will help you explore all the features supported by Azure Databricks for building powerful end-to-end data pipelines. You'll also find out how to build a modern data warehouse by using Delta tables and Azure Synapse Analytics. Later, you’ll learn how to write ad hoc queries and extract meaningful insights from the data lake by creating visualizations and dashboards with Databricks SQL. Finally, you'll deploy and productionize a data pipeline as well as deploy notebooks and Azure Databricks service using continuous integration and continuous delivery (CI/CD). By the end of this Azure book, you'll be able to use Azure Databricks to streamline different processes involved in building data-driven apps.
Table of Contents (12 chapters)

Understanding the various stages of transforming data

Building a near-real-time warehouse is being used these days as a common architectural pattern for many organizations who want to avoid the delays that we see in on-premises data warehouse systems. Customers want to view the data in near real time in their new modern warehouse architecture and they can achieve that by using Azure Databricks Delta Lake with Spark Structured Streaming APIs. In this recipe, you will learn the various stages involved in building a near-real-time data warehouse in Delta Lake. We are storing the data in a denormalized way in Delta Lake, but in a Synapse dedicated SQL pool, we are storing the data in facts and dimension tables to enhance reporting capabilities.

As part of data processing in Delta Lake, you will be creating three Delta tables, as follows:

  1. Bronze table: This will hold the data as received from Event Hubs for Kafka.
  2. Silver table: We will implement the required business rules...