Azure Databricks Cookbook

By : Phani Raj, Vinod Jaiswal

Azure Databricks Cookbook

By: Phani Raj, Vinod Jaiswal

Overview of this book

Azure Databricks is a unified collaborative platform for performing scalable analytics in an interactive environment. The Azure Databricks Cookbook provides recipes to get hands-on with the analytics process, including ingesting data from various batch and streaming sources and building a modern data warehouse. The book starts by teaching you how to create an Azure Databricks instance within the Azure portal, Azure CLI, and ARM templates. You’ll work through clusters in Databricks and explore recipes for ingesting data from sources, including files, databases, and streaming sources such as Apache Kafka and EventHub. The book will help you explore all the features supported by Azure Databricks for building powerful end-to-end data pipelines. You'll also find out how to build a modern data warehouse by using Delta tables and Azure Synapse Analytics. Later, you’ll learn how to write ad hoc queries and extract meaningful insights from the data lake by creating visualizations and dashboards with Databricks SQL. Finally, you'll deploy and productionize a data pipeline as well as deploy notebooks and Azure Databricks service using continuous integration and continuous delivery (CI/CD). By the end of this Azure book, you'll be able to use Azure Databricks to streamline different processes involved in building data-driven apps.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Chapter 1: Creating an Azure Databricks Service

Technical requirements

Creating a Databricks workspace in the Azure portal

Creating a Databricks service using the Azure CLI (command-line interface)

Creating a Databricks service using Azure Resource Manager (ARM) templates

Adding users and groups to the workspace

Creating a cluster from the user interface (UI)

Getting started with notebooks and jobs in Azure Databricks

Authenticating to Databricks using a PAT

Free Chapter

Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats

Technical requirements

Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS

Reading and writing data from and to Azure Blob storage

Reading and writing data from and to ADLS Gen2

Reading and writing data from and to an Azure SQL database using native connectors

Reading and writing data from and to Azure Synapse SQL (dedicated SQL pool) using native connectors

Reading and writing data from and to Azure Cosmos DB

Reading and writing data from and to CSV and Parquet

Reading and writing data from and to JSON, including nested JSON

Chapter 3: Understanding Spark Query Execution

Technical requirements

Introduction to jobs, stages, and tasks

Checking the execution details of all the executed Spark queries via the Spark UI

Deep diving into schema inference

Looking into the query execution plan

How joins work in Spark

Learning about input partitions

Learning about output partitions

Learning about shuffle partitions

Storage benefits of different file types

Chapter 4: Working with Streaming Data

Technical requirements

Reading streaming data from Apache Kafka

Reading streaming data from Azure Event Hubs

Reading data from Event Hubs for Kafka

Streaming data from log files

Understanding trigger options

Understanding window aggregation on streaming data

Understanding offsets and checkpoints

Chapter 5: Integrating with Azure Key Vault, App Configuration, and Log Analytics

Technical requirements

Creating an Azure Key Vault to store secrets using the UI

Creating an Azure Key Vault to store secrets using ARM templates

Using Azure Key Vault secrets in Azure Databricks

Creating an App Configuration resource

Using App Configuration in an Azure Databricks notebook

Creating a Log Analytics workspace

Integrating a Log Analytics workspace with Azure Databricks

Chapter 6: Exploring Delta Lake in Azure Databricks

Technical requirements

Delta table operations – create, read, and write

Streaming reads and writes to Delta tables

Delta table data format

Handling concurrency

Delta table performance optimization

Constraints in Delta tables

Versioning in Delta tables

Chapter 7: Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse

Technical requirements

Understanding the scenario for an end-to-end (E2E) solution

Creating required Azure resources for the E2E demonstration

Simulating a workload for streaming data

Processing streaming and batch data using Structured Streaming

Understanding the various stages of transforming data

Loading the transformed data into Azure Cosmos DB and a Synapse dedicated pool

Creating a visualization and dashboard in a notebook for near-real-time analytics

Creating a visualization in Power BI for near-real-time analytics

Using Azure Data Factory (ADF) to orchestrate the E2E pipeline

Chapter 8: Databricks SQL

Technical requirements

How to create a user in Databricks SQL

Creating SQL endpoints

Granting access to objects to the user

Running SQL queries in Databricks SQL

Using query parameters and filters

Introduction to visualizations in Databricks SQL

Creating dashboards in Databricks SQL

Connecting Power BI to Databricks SQL

Chapter 9: DevOps Integrations and Implementing CI/CD for Azure Databricks

Technical requirements

How to integrate Azure DevOps with an Azure Databricks notebook

Using GitHub for Azure Databricks notebook version control

Understanding the CI/CD process for Azure Databricks

How to set up an Azure DevOps pipeline for deploying notebooks

Deploying notebooks to multiple environments

Enabling CI/CD in an Azure DevOps build and release pipeline

Deploying an Azure Databricks service using an Azure DevOps release pipeline

Chapter 10: Understanding Security and Monitoring in Azure Databricks

Technical requirements

Understanding and creating RBAC in Azure for ADLS Gen-2

Creating ACLs using Storage Explorer and PowerShell

How to configure credential passthrough

How to restrict data access to users using RBAC

How to restrict data access to users using ACLs

Deploying Azure Databricks in a VNet and accessing a secure storage account

Using Ganglia reports for cluster health

Cluster access control

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Learning about input partitions

Partitions are subsets of files in memory or storage. In Spark, partitions are more utilized compared to the Hive system or SQL databases. Spark uses partitions for parallel processing and to gain maximum performance.

Spark and Hive partitions are different; Spark processes data in memory, whereas Hive partitions are in storage. In this recipe, we will cover three different partitions; that is, the input, shuffle, and output partitions.

Let's start by looking at input partitions.

Getting ready

Apache Spark has a layered architecture, and the driver nodes communicate with the worker nodes to get the job done. All the data processing happens in the worker nodes. When the job is submitted for processing, each data partition is sent to the specific executors. Each executor processes one partition at a time. Hence, the time it takes each executor to process data is directly proportional to the size and number of partitions. The more...

Azure Databricks Cookbook

By : Phani Raj, Vinod Jaiswal

Azure Databricks Cookbook

By: Phani Raj, Vinod Jaiswal

Overview of this book

Related Content you might be interested in

Current Title:

Azure Databricks Cookbook

Optimizing Databricks Workloads

Distributed Data Systems with Azure Databricks

Azure Data Factory Cookbook

Learning about input partitions

Getting ready