Azure Databricks Cookbook

By : Phani Raj, Vinod Jaiswal

Azure Databricks Cookbook

By: Phani Raj, Vinod Jaiswal

Overview of this book

Azure Databricks is a unified collaborative platform for performing scalable analytics in an interactive environment. The Azure Databricks Cookbook provides recipes to get hands-on with the analytics process, including ingesting data from various batch and streaming sources and building a modern data warehouse. The book starts by teaching you how to create an Azure Databricks instance within the Azure portal, Azure CLI, and ARM templates. You’ll work through clusters in Databricks and explore recipes for ingesting data from sources, including files, databases, and streaming sources such as Apache Kafka and EventHub. The book will help you explore all the features supported by Azure Databricks for building powerful end-to-end data pipelines. You'll also find out how to build a modern data warehouse by using Delta tables and Azure Synapse Analytics. Later, you’ll learn how to write ad hoc queries and extract meaningful insights from the data lake by creating visualizations and dashboards with Databricks SQL. Finally, you'll deploy and productionize a data pipeline as well as deploy notebooks and Azure Databricks service using continuous integration and continuous delivery (CI/CD). By the end of this Azure book, you'll be able to use Azure Databricks to streamline different processes involved in building data-driven apps.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Chapter 1: Creating an Azure Databricks Service

Technical requirements

Creating a Databricks workspace in the Azure portal

Creating a Databricks service using the Azure CLI (command-line interface)

Creating a Databricks service using Azure Resource Manager (ARM) templates

Adding users and groups to the workspace

Creating a cluster from the user interface (UI)

Getting started with notebooks and jobs in Azure Databricks

Authenticating to Databricks using a PAT

Free Chapter

Chapter 2: Reading and Writing Data from and to Various Azure Services and File Formats

Technical requirements

Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS

Reading and writing data from and to Azure Blob storage

Reading and writing data from and to ADLS Gen2

Reading and writing data from and to an Azure SQL database using native connectors

Reading and writing data from and to Azure Synapse SQL (dedicated SQL pool) using native connectors

Reading and writing data from and to Azure Cosmos DB

Reading and writing data from and to CSV and Parquet

Reading and writing data from and to JSON, including nested JSON

Chapter 3: Understanding Spark Query Execution

Technical requirements

Introduction to jobs, stages, and tasks

Checking the execution details of all the executed Spark queries via the Spark UI

Deep diving into schema inference

Looking into the query execution plan

How joins work in Spark

Learning about input partitions

Learning about output partitions

Learning about shuffle partitions

Storage benefits of different file types

Chapter 4: Working with Streaming Data

Technical requirements

Reading streaming data from Apache Kafka

Reading streaming data from Azure Event Hubs

Reading data from Event Hubs for Kafka

Streaming data from log files

Understanding trigger options

Understanding window aggregation on streaming data

Understanding offsets and checkpoints

Chapter 5: Integrating with Azure Key Vault, App Configuration, and Log Analytics

Technical requirements

Creating an Azure Key Vault to store secrets using the UI

Creating an Azure Key Vault to store secrets using ARM templates

Using Azure Key Vault secrets in Azure Databricks

Creating an App Configuration resource

Using App Configuration in an Azure Databricks notebook

Creating a Log Analytics workspace

Integrating a Log Analytics workspace with Azure Databricks

Chapter 6: Exploring Delta Lake in Azure Databricks

Technical requirements

Delta table operations – create, read, and write

Streaming reads and writes to Delta tables

Delta table data format

Handling concurrency

Delta table performance optimization

Constraints in Delta tables

Versioning in Delta tables

Chapter 7: Implementing Near-Real-Time Analytics and Building a Modern Data Warehouse

Technical requirements

Understanding the scenario for an end-to-end (E2E) solution

Creating required Azure resources for the E2E demonstration

Simulating a workload for streaming data

Processing streaming and batch data using Structured Streaming

Understanding the various stages of transforming data

Loading the transformed data into Azure Cosmos DB and a Synapse dedicated pool

Creating a visualization and dashboard in a notebook for near-real-time analytics

Creating a visualization in Power BI for near-real-time analytics

Using Azure Data Factory (ADF) to orchestrate the E2E pipeline

Chapter 8: Databricks SQL

Technical requirements

How to create a user in Databricks SQL

Creating SQL endpoints

Granting access to objects to the user

Running SQL queries in Databricks SQL

Using query parameters and filters

Introduction to visualizations in Databricks SQL

Creating dashboards in Databricks SQL

Connecting Power BI to Databricks SQL

Chapter 9: DevOps Integrations and Implementing CI/CD for Azure Databricks

Technical requirements

How to integrate Azure DevOps with an Azure Databricks notebook

Using GitHub for Azure Databricks notebook version control

Understanding the CI/CD process for Azure Databricks

How to set up an Azure DevOps pipeline for deploying notebooks

Deploying notebooks to multiple environments

Enabling CI/CD in an Azure DevOps build and release pipeline

Deploying an Azure Databricks service using an Azure DevOps release pipeline

Chapter 10: Understanding Security and Monitoring in Azure Databricks

Technical requirements

Understanding and creating RBAC in Azure for ADLS Gen-2

Creating ACLs using Storage Explorer and PowerShell

How to configure credential passthrough

How to restrict data access to users using RBAC

How to restrict data access to users using ACLs

Deploying Azure Databricks in a VNet and accessing a secure storage account

Using Ganglia reports for cluster health

Cluster access control

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Reading and writing data from and to CSV and Parquet

Azure Databricks supports multiple file formats, including sequence files, Record Columnar files, and Optimized Row Columnar files. It also provides native support for CSV, JSON, and Parquet file formats.

Parquet is the most widely used file format in the Databricks Cloud for the following reasons:

Columnar storage format—Stores data column-wise, unlike row-based format files such as Avro and CSV.
Open source—Parquet is open source and free to use.
Aggressive compression—Parquet supports compression, which is not available in most file formats. Because of its compression technique, it requires slow storage compared to other file formats. It uses different encoding methods for compressions.
Performance—The Parquet file format is designed for optimized performance. You can get the relevant data quickly as it saves both data and metadata. The amount of data scanned is comparatively smaller, resulting in less input/output (I/O) usage.
Schema evaluation—It supports changes in the column schema as required. Multiple Parquet files with compatible schemas can be merged.
Self-describing—Each Parquet file contains metadata and data, which makes it self-describing.

Parquet files also support predicate push-down, column filtering, static, and dynamic partition pruning.

In this recipe, you will learn how to read from and write to CSV and Parquet files using Azure Databricks.

Getting ready

You can follow the steps by running the steps in the 2_7.Reading and Writing data from and to CSV, Parquet.ipynb notebook in your local cloned repository in the Chapter02 folder.

Upload the csvFiles folder in the Chapter02/Customer folder to the ADLS Gen2 storage account in the rawdata file system and in Customer/csvFiles folder.

How to do it…

Here are the steps and code samples for reading from and writing to CSV and Parquet files using Azure Databricks. You will find a separate section for processing CSV and Parquet file formats.

Working with the CSV file format

Go through the following steps for reading CSV files and saving data in CSV format.

Ensure that you have mounted the ADLS Gen2 Storage location. If not, you can refer to the Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS recipe in this chapter to follow the steps for mounting a storage account.
Run the following code to list the CSV data files from the mounted ADLS Gen2 storage account:
```
#Listing CSV Files
dbutils.fs.ls("/mnt/Gen2Source/Customer/csvFiles")
```

Read the customer data stored in csv files in the ADLS Gen2 storage account by running the following code:

customerDF = spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Gen2Source/Customer/csvFiles")

You can display the result of a Dataframe by running the following code:
```
customerDF.show()
```
By running the following code, we are writing customerDF DataFrame data to the location /mnt/Gen2Source/Customer/WriteCsvFiles in CSV format.
```
customerDF.write.mode("overwrite").option("header", "true").csv("/mnt/Gen2Source/Customer/WriteCsvFiles")
```

To confirm that the data is written to the target folder in csv format, let's read the csv files from target folder by running the following code.

targetDF = spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Gen2Source/Customer/WriteCsvFiles")
targetDF.show()

In the following section we will learn how to read data from and write data to parquet files.

Working with the Parquet file format

Let's get started.

You can use the same customer dataset for reading from the CSV files and writing into the Parquet file format.
We will use the targetDF DataFrame used in Step 6 and save it as parquet format by running the following code. We are using save mode as overwrite in the following code. Using overwrite save option, existing data is overwritten in the target or destination folder mentioned.
```
#Writing the targetDF data which has the CSV data read as parquet File using append mode
targetDF.write.mode("overwrite").option("header", "true").parquet("/mnt/Gen2Source/Customer/csvasParquetFiles/") 
```

In the following code, we are reading data from csvasParquetFiles folder to confirm the data in parquet format:

df_parquetfiles=spark.read.format("parquet").option("header",True).load("/mnt/Gen2Source/Customer/csvasParquetFiles/") 
display(df_parquetfiles.limit(5))

Let's change the save mode from overwrite to append by running the following code. Using save mode as append, new data will be inserted, and existing data is preserved in the target or destination folder:
```
#Using overwrite as option for save mode
targetDF.write.mode("append").option("header", "true").parquet("/mnt/Gen2Source/Customer/csvasParquetFiles/") 
```
Run the following code to check the count of records in the parquet folder and number should increase as we have appended the data to the same folder.
```
df_parquetfiles=spark.read.format("parquet").option("header",True).load("/mnt/Gen2Source/Customer/csvasParquetFiles/")
df_parquetfiles.count()
```

By the end of this recipe, you have learnt how to read from and write to CSV and Parquet files.

How it works…

The CSV file format is a widely used format by many tools, and it's also a default format for processing data. There are many disadvantages when you compare it in terms of cost, query processing time, and size of the data files. The CSV format is not that effective compared with what you will find in the Parquet file format. Also, it doesn't support partition pruning, which directly impacts the cost of storing and processing data in CSV format.

Conversely, Parquet is a columnar format that supports compression and partition pruning. It is widely used for processing data in big data projects for both reading and writing data. A Parquet file stores data and metadata, which makes it self-describing.

Parquet also supports schema evolution, which means you can change the schema of the data as required. This helps in developing systems that can accommodate changes in the schema as it matures. In such cases, you may end up with multiple Parquet files that have different schemas but are compatible.

Azure Databricks Cookbook

By : Phani Raj, Vinod Jaiswal

Azure Databricks Cookbook

By: Phani Raj, Vinod Jaiswal

Overview of this book

Related Content you might be interested in

Current Title:

Azure Databricks Cookbook

Optimizing Databricks Workloads

Distributed Data Systems with Azure Databricks

Azure Data Factory Cookbook

Reading and writing data from and to CSV and Parquet

Getting ready

How to do it…

Working with the CSV file format

Working with the Parquet file format

How it works…