Azure Databricks Cookbook

By : Phani Raj, Vinod Jaiswal

Azure Databricks Cookbook

By: Phani Raj, Vinod Jaiswal

Overview of this book

Azure Databricks is a unified collaborative platform for performing scalable analytics in an interactive environment. The Azure Databricks Cookbook provides recipes to get hands-on with the analytics process, including ingesting data from various batch and streaming sources and building a modern data warehouse. The book starts by teaching you how to create an Azure Databricks instance within the Azure portal, Azure CLI, and ARM templates. You’ll work through clusters in Databricks and explore recipes for ingesting data from sources, including files, databases, and streaming sources such as Apache Kafka and EventHub. The book will help you explore all the features supported by Azure Databricks for building powerful end-to-end data pipelines. You'll also find out how to build a modern data warehouse by using Delta tables and Azure Synapse Analytics. Later, you’ll learn how to write ad hoc queries and extract meaningful insights from the data lake by creating visualizations and dashboards with Databricks SQL. Finally, you'll deploy and productionize a data pipeline as well as deploy notebooks and Azure Databricks service using continuous integration and continuous delivery (CI/CD). By the end of this Azure book, you'll be able to use Azure Databricks to streamline different processes involved in building data-driven apps.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Chapter 1: Creating an Azure Databricks Service

Technical requirements

Creating a Databricks workspace in the Azure portal

Creating a Databricks service using the Azure CLI (command-line interface)

Creating a Databricks service using Azure Resource Manager (ARM) templates

Adding users and groups to the workspace

Creating a cluster from the user interface (UI)

Getting started with notebooks and jobs in Azure Databricks

Authenticating to Databricks using a PAT

Free Chapter

Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats

Technical requirements

Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS

Reading and writing data from and to Azure Blob storage

Reading and writing data from and to ADLS Gen2

Reading and writing data from and to an Azure SQL database using native connectors

Reading and writing data from and to Azure Synapse SQL (dedicated SQL pool) using native connectors

Reading and writing data from and to Azure Cosmos DB

Reading and writing data from and to CSV and Parquet

Reading and writing data from and to JSON, including nested JSON

Chapter 3: Understanding Spark Query Execution

Technical requirements

Introduction to jobs, stages, and tasks

Checking the execution details of all the executed Spark queries via the Spark UI

Deep diving into schema inference

Looking into the query execution plan

How joins work in Spark

Learning about input partitions

Learning about output partitions

Learning about shuffle partitions

Storage benefits of different file types

Chapter 4: Working with Streaming Data

Technical requirements

Reading streaming data from Apache Kafka

Reading streaming data from Azure Event Hubs

Reading data from Event Hubs for Kafka

Streaming data from log files

Understanding trigger options

Understanding window aggregation on streaming data

Understanding offsets and checkpoints

Chapter 5: Integrating with Azure Key Vault, App Configuration, and Log Analytics

Technical requirements

Creating an Azure Key Vault to store secrets using the UI

Creating an Azure Key Vault to store secrets using ARM templates

Using Azure Key Vault secrets in Azure Databricks

Creating an App Configuration resource

Using App Configuration in an Azure Databricks notebook

Creating a Log Analytics workspace

Integrating a Log Analytics workspace with Azure Databricks

Chapter 6: Exploring Delta Lake in Azure Databricks

Technical requirements

Delta table operations – create, read, and write

Streaming reads and writes to Delta tables

Delta table data format

Handling concurrency

Delta table performance optimization

Constraints in Delta tables

Versioning in Delta tables

Chapter 7: Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse

Technical requirements

Understanding the scenario for an end-to-end (E2E) solution

Creating required Azure resources for the E2E demonstration

Simulating a workload for streaming data

Processing streaming and batch data using Structured Streaming

Understanding the various stages of transforming data

Loading the transformed data into Azure Cosmos DB and a Synapse dedicated pool

Creating a visualization and dashboard in a notebook for near-real-time analytics

Creating a visualization in Power BI for near-real-time analytics

Using Azure Data Factory (ADF) to orchestrate the E2E pipeline

Chapter 8: Databricks SQL

Technical requirements

How to create a user in Databricks SQL

Creating SQL endpoints

Granting access to objects to the user

Running SQL queries in Databricks SQL

Using query parameters and filters

Introduction to visualizations in Databricks SQL

Creating dashboards in Databricks SQL

Connecting Power BI to Databricks SQL

Chapter 9: DevOps Integrations and Implementing CI/CD for Azure Databricks

Technical requirements

How to integrate Azure DevOps with an Azure Databricks notebook

Using GitHub for Azure Databricks notebook version control

Understanding the CI/CD process for Azure Databricks

How to set up an Azure DevOps pipeline for deploying notebooks

Deploying notebooks to multiple environments

Enabling CI/CD in an Azure DevOps build and release pipeline

Deploying an Azure Databricks service using an Azure DevOps release pipeline

Chapter 10: Understanding Security and Monitoring in Azure Databricks

Technical requirements

Understanding and creating RBAC in Azure for ADLS Gen-2

Creating ACLs using Storage Explorer and PowerShell

How to configure credential passthrough

How to restrict data access to users using RBAC

How to restrict data access to users using ACLs

Deploying Azure Databricks in a VNet and accessing a secure storage account

Using Ganglia reports for cluster health

Cluster access control

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Reading and writing data from and to Azure Cosmos DB

Azure Cosmos DB is Microsoft's globally distributed multi-model database service. Azure Cosmos DB enables you to manage your data scattered around different data centers across the world and also provides a mechanism to scale data distribution patterns and computational resources. It supports multiple data models, which means it can be used for storing documents and relational, key-value, and graph models. It is more or less a NoSQL database as it doesn't have any schema. Azure Cosmos DB provides APIs for the following data models, and their software development kits (SDKs) are available in multiple languages:

SQL API
MongoDB API
Cassandra API
Graph (Gremlin) API
Table API

The Cosmos DB Spark connector is used for accessing Azure Cosmos DB. It is used for batch and streaming data processing and as a serving layer for the required data. It supports both the Scala and Python languages. The Cosmos DB Spark connector supports the core (SQL) API of Azure Cosmos DB.

This recipe explains how to read and write data to and from Azure Cosmos DB using Azure Databricks.

Getting ready

You will need to ensure you have the following items before starting to work on this recipe:

An Azure Databricks workspace. Refer to Chapter 1, Creating an Azure Databricks Service, to create an Azure Databricks workspace.
Download the Cosmos DB Spark connector.
An Azure Cosmos DB account.

You can follow the steps mentioned in the following link to create Azure Cosmos DB account from Azure Portal.

https://docs.microsoft.com/en-us/azure/cosmos-db/create-cosmosdb-resources-portal#:~:text=How%20to%20Create%20a%20Cosmos%20DB%20Account%201,the%20Azure%20Cosmos%20DB%20account%20page.%20See%20More.

Once the Azure Cosmos DB account is created create a database with name Sales and container with name Customer and use the Partition key as /C_MKTSEGMENT while creating the new container as shown in the following screenshot.

Figure 2.17 – Adding New Container in Cosmos DB Account in Sales Database

You can follow the steps by running the steps in the 2_6.Reading and Writing Data from and to Azure Cosmos DB.ipynb notebook in your local cloned repository in the Chapter02 folder.

Upload the csvFiles folder in the Chapter02/Customer folder to the ADLS Gen2 account in the rawdata file system.

Note

At the time of writing this recipe Cosmos DB connector for Spark 3.0 is not available.

You can download the latest Cosmos DB Spark uber-jar file from following link. Latest one at the time of writing this recipe was 3.6.14.

https://search.maven.org/artifact/com.microsoft.azure/azure-cosmosdb-spark_2.4.0_2.11/3.6.14/jar

If you want to work with 3.6.14 version then you can download the jar file from following GitHub URL as well.

https://github.com/PacktPublishing/Azure-Databricks-Cookbook/blob/main/Chapter02/azure-cosmosdb-spark_2.4.0_2.11-3.6.14-uber.jar

You need to get the Endpoint and MasterKey for the Azure Cosmos DB which will be used to authenticate to Azure Cosmos DB account from Azure Databricks. To get the Endpoint and MasterKey, go to Azure Cosmos DB account and click on Keys under the Settings section and copy the values for URI and PRIMARY KEY under Read-write Keys tab.

How to do it…

Let's get started with this section.

Create a new Spark Cluster and ensure you are choosing the configuration that is supported by the Spark Cosmos connector. Choosing low or higher version will give errors while reading data from Azure Cosmos DB hence select the right configuration while creating the cluster as shown in following table:
Table 2.2 – Configuration to create a new cluster
The following screenshot shows the configuration of the cluster:
Figure 2.18 – Azure Databricks cluster
After your cluster is created, navigate to the cluster page, and select the Libraries tab. Select Install New and upload the Spark connector jar file to install the library. This is the uber jar file which is mentioned in the Getting ready section:
Figure 2.19 – Cluster library installation
You can verify that the library was installed on the Libraries tab:
Figure 2.20 – Cluster verifying library installation
Once the library is installed, you are good to connect to Cosmos DB from the Azure Databricks notebook.
We will use the customer data from the ADLS Gen2 storage account to write the data in Cosmos DB. Run the following code to list the csv files in the storage account:
```
display(dbutils.fs.ls("/mnt/Gen2/ Customer/csvFiles/")) 
```

Run the following code which will read the csv files from mount point into a DataFrame.

customerDF = spark.read.format("csv").option("header",True).option("inferSchema", True).load("dbfs:/mnt/Gen2Source/Customer/csvFiles")

Provide the cosmos DB configuration by executing the following code. Collection is the Container that you have created in the Sales Database in Cosmos DB.

writeConfig = (
  "Endpoint" : "https://testcosmosdb.documents.azure.com:443/",
  "Masterkey" : "xxxxx-xxxx-xxx"
  "Database" : "Sales",
  "Collection" :"Customer",
  "preferredRegions" : "East US")

Run the following code to write the csv files loaded in customerDF DataFrame to Cosmos DB. We are using save mode as append.

#Writing DataFrame to Cosmos DB. If the Comos DB RU's are less then it will take quite some time to write 150K records. We are using save mode as append.
customerDF.write.format("com.microsoft.azure.cosmosdb.spark") \
.options(**writeConfig)\
.mode("append")\
.save()

To overwrite the data, we must use save mode as overwrite as shown in the following code.

#Writing DataFrame to Cosmos DB. If the Comos DB RU's are less then it will take quite some time to write 150K records. We are using save mode as overwrite.
customerDF.write.format("com.microsoft.azure.cosmosdb.spark") \
.options(**writeConfig)\
.mode("overwrite")\
.save()

Now let's read the data written to Cosmos DB. First, we need to set the config values by running the following code.

readConfig = {
 "Endpoint" : "https://testcosmosdb.documents.azure.com:443/",
 "Masterkey" : "xxx-xxx-xxx",
  "Database" : "Sales", 
  "Collection" : "Customer",
  "preferredRegions" : "Central US;East US2",
  "SamplingRatio" : "1.0",
  "schema_samplesize" : "1000",
  "query_pagesize" : "2147483647",
  "query_custom" : "SELECT * FROM c where c.C_MKTSEGMENT ='AUTOMOBILE'" # 
}

After setting the config values, run the following code to read the data from Cosmos DB. In the query_custom we are filtering the data for AUTOMOBILE market segments.
```
df_Customer = spark.read.format("com.microsoft.azure.cosmosdb.spark").options(**readConfig).load()
df_Customer.count() 
```
You can run the following code to display the contents of the DataFrame.
```
display(df_Customer.limit(5))
```

By the end of this section, you have learnt how to load and read the data into and from Cosmos DB using Azure Cosmos DB Connector for Apache Spark.

How it works…

azure-cosmosdb-spark is the official connector for Azure Cosmos DB and Apache Spark. This connector allows you to easily read to and write from Azure Cosmos DB via Apache Spark DataFrames in Python and Scala. It also allows you to easily create a lambda architecture for batch processing, stream processing, and a serving layer while being globally replicated and minimizing the latency involved in working with big data.

Azure Cosmos DB Connector is a client library that allows Azure Cosmos DB to act as an input source or output sink for Spark jobs. Fast connectivity between Apache Spark and Azure Cosmos DB provides the ability to process data in a performant way. Data can be quickly persisted and retrieved using Azure Cosmos DB with the Spark to Cosmos DB connector. This also helps to solve scenarios, including blazing fast Internet of Things (IoT) scenarios, and while performing analytics, push-down predicate filtering, and advanced analytics.

We can use query_pagesize as a parameter to control number of documents that each query page should hold. Larger the value for query_pagesize, lesser is the network roundtrip which is required to fetch the data and thus leading to better throughput.

Azure Databricks Cookbook

By : Phani Raj, Vinod Jaiswal

Azure Databricks Cookbook

By: Phani Raj, Vinod Jaiswal

Overview of this book

Related Content you might be interested in

Current Title:

Azure Databricks Cookbook

Optimizing Databricks Workloads

Distributed Data Systems with Azure Databricks

Azure Data Factory Cookbook

Reading and writing data from and to Azure Cosmos DB

Getting ready

How to do it…

How it works…