Azure Data Engineering Cookbook - Second Edition

By : Nagaraj Venkatesan, Ahmad Osama

Azure Data Engineering Cookbook - Second Edition

By: Nagaraj Venkatesan, Ahmad Osama

Overview of this book

The famous quote 'Data is the new oil' seems more true every day as the key to most organizations' long-term success lies in extracting insights from raw data. One of the major challenges organizations face in leveraging value out of data is building performant data engineering pipelines for data visualization, ingestion, storage, and processing. This second edition of the immensely successful book by Ahmad Osama brings to you several recent enhancements in Azure data engineering and shares approximately 80 useful recipes covering common scenarios in building data engineering pipelines in Microsoft Azure. You’ll explore recipes from Azure Synapse Analytics workspaces Gen 2 and get to grips with Synapse Spark pools, SQL Serverless pools, Synapse integration pipelines, and Synapse data flows. You’ll also understand Synapse SQL Pool optimization techniques in this second edition. Besides Synapse enhancements, you’ll discover helpful tips on managing Azure SQL Database and learn about security, high availability, and performance monitoring. Finally, the book takes you through overall data engineering pipeline management, focusing on monitoring using Log Analytics and tracking data lineage using Azure Purview. By the end of this book, you’ll be able to build superior data engineering pipelines along with having an invaluable go-to guide.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the color images

Chapter 1: Creating and Managing Data in Azure Data Lake

Technical requirements

Provisioning an Azure storage account using the Azure portal

Provisioning an Azure storage account using PowerShell

Creating containers and uploading files to Azure Blob storage using PowerShell

Managing blobs in Azure Storage using PowerShell

Configuring blob lifecycle management for blob objects using the Azure portal

Free Chapter

Chapter 2: Securing and Monitoring Data in Azure Data Lake

Configuring a firewall for an Azure Data Lake account using the Azure portal

Configuring virtual networks for an Azure Data Lake account using the Azure portal

Configuring private links for an Azure Data Lake account

Configuring encryption using Azure Key Vault for Azure Data Lake

Accessing Blob storage accounts using managed identities

Creating an alert to monitor an Azure storage account

Securing an Azure storage account with SAS using PowerShell

Chapter 3: Building Data Ingestion Pipelines Using Azure Data Factory

Technical requirements

Provisioning Azure Data Factory

Copying files to a database from a data lake using a control flow and copy activity

Triggering a pipeline in Azure Data Factory

Copying data from a SQL Server virtual machine to a data lake using the Copy data wizard

Chapter 4: Azure Data Factory Integration Runtime

Technical requirements

Configuring a self-hosted IR

Configuring a shared self-hosted IR

Configuring high availability for a self-hosted IR

Patching a self-hosted IR

Migrating an SSIS package to Azure Data Factory

Chapter 5: Configuring and Securing Azure SQL Database

Technical requirements

Provisioning and connecting to an Azure SQL database using PowerShell

Implementing an Azure SQL Database elastic pool using PowerShell

Configuring a virtual network and private endpoints for Azure SQL Database

Configuring Azure Key Vault for Azure SQL Database

Provisioning and configuring a wake-up script for a serverless SQL database

Configuring the Hyperscale tier of Azure SQL Database

Chapter 6: Implementing High Availability and Monitoring in Azure SQL Database

Implementing active geo-replication for an Azure SQL database using PowerShell

Implementing an auto-failover group for an Azure SQL database using PowerShell

Configuring high availability to the Hyperscale tier of Azure SQL Database

Implementing vertical scaling for an Azure SQL database using PowerShell

Monitoring an Azure SQL database using the Azure portal

Configuring auditing for Azure SQL Database

Chapter 7: Processing Data Using Azure Databricks

Technical requirements

Configuring the Azure Databricks environment

Integrating Databricks with Azure Key Vault

Mounting an Azure Data Lake container in Databricks

Processing data using notebooks

Scheduling notebooks using job clusters

Working with Delta Lake tables

Connecting a Databricks Delta Lake table to Power BI

Chapter 8: Processing Data Using Azure Synapse Analytics

Technical requirements

Provisioning an Azure Synapse Analytics workspace

Analyzing data using serverless SQL pool

Provisioning and configuring Spark pools

Processing data using Spark pools and a lake database

Querying the data in a lake database from serverless SQL pool

Scheduling notebooks to process data incrementally

Visualizing data using Power BI by connecting to serverless SQL pool

Chapter 9: Transforming Data Using Azure Synapse Dataflows

Technical requirements

Copying data using a Synapse data flow

Performing data transformation using activities such as join, sort, and filter

Monitoring data flows and pipelines

Configuring partitions to optimize data flows

Parameterizing Synapse data flows

Handling schema changes dynamically in data flows using schema drift

Chapter 10: Building the Serving Layer in Azure Synapse SQL Pool

Technical requirements

Loading data into dedicated SQL pools using PolyBase and T-SQL

Loading data into a dedicated SQL pool using COPY INTO

Creating distributed tables and modifying table distribution

Creating statistics and automating the update of statistics

Creating partitions and archiving data using partitioned tables

Implementing workload management in an Azure Synapse dedicated SQL pool

Creating workload groups for advanced workload management

Chapter 11: Monitoring Synapse SQL and Spark Pools

Technical requirements

Configuring a Log Analytics workspace for Synapse SQL pools

Configuring a Log Analytics workspace for Synapse Spark pools

Using Kusto queries to monitor SQL and Spark pools

Creating workbooks in a Log Analytics workspace to visualize monitoring data

Monitoring table distribution, data skew, and index health using Synapse DMVs

Building monitoring dashboards for Synapse with Azure Monitor

Chapter 12: Optimizing and Maintaining Synapse SQL and Spark Pools

Technical requirements

Analyzing a query plan and fixing table distribution

Monitoring and rebuilding a replication table cache

Configuring result set caching in Azure Synapse dedicated SQL pool

Configuring longer backup retention for a Synapse SQL database

Auto pausing Synapse dedicated SQL pool

Optimizing Delta tables in a Synapse Spark pool lake database

Optimizing query performance in Synapse Spark pools

Chapter 13: Monitoring and Maintaining Azure Data Engineering Pipelines

Technical requirements

Monitoring Synapse integration pipelines using Log Analytics and workbooks

Tracing SQL queries for dedicated SQL pool to Synapse integration pipelines

Provisioning a Microsoft Purview account and creating a data catalog

Integrating a Synapse workspace with Microsoft Purview and tracking data lineage

Applying Azure tags using PowerShell to multiple Azure resources

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share your thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Accessing Blob storage accounts using managed identities

In this recipe, we will grant permissions to managed identities on a storage account and showcase how you can use managed identities to connect to Azure Data Lake.

Managed identities are password-less service accounts used by Azure services such as Data Factory and Azure VMs to access other Azure services, such as Blob storage. In this recipe, we will show you how Azure Data Factory's managed identity can be granted permission on an Azure Blob storage account.

Getting ready

Before you start, perform the following steps:

Open a web browser and go to the Azure portal at https://portal.azure.com.
Make sure you have an existing storage account. If not, create one using the Provisioning an Azure storage account using the Azure portal recipe in Chapter 1, Creating and Managing Data in Azure Data Lake.

How to do it…

We will be testing accessing a Data Lake account using managed identities. To achieve this, we will create a Data Factory account and use Data Factory's managed identity to access the Data Lake account. Perform the following steps to test this:

Create an Azure Data Factory by using the following PowerShell command:

$resourceGroupName = " packtadestorage";
$location = 'east us'
$dataFactoryName = "ADFPacktADE2";
$DataFactory = Set-AzDataFactoryV2 -ResourceGroupName $resourceGroupName -Location $location -Name $dataFactoryName

Go to the storage account in the Azure portal. Click on Access Control (IAM) and then Add, as shown in the following screenshot:

Figure 2.22 – Adding a role to a managed identity

Select Add role assignment and search for the Storage Blob Data Contributor role. Select the role and click Next. Select Managed identity in Assign access to and click on + Select members, as shown in the following screenshot:

Figure 2.23 – Selecting the Data Factory managed identity

Your subscription should be selected by default. From the Managed identity dropdown, select Data Factory (V2) (1). Select the recently created ADFPacktADE2 Data Factory and click on the Select button:

Figure 2.24 – Assigning a role to a managed identity

Click on Review + Assign to complete the assignment. To test whether it's working, open the ADFPacktADE2 Data Factory that was created in step 1. Click on Open Azure Data Factory Studio, as shown in the next screenshot:

Figure 2.25 – Opening Azure Data Factory Studio

Click on the Manage button on the left and then Linked services. Click on + New, as shown in the following screenshot:

Figure 2.26 – Creating a linked service in Data Factory

Search for Data Lake and select Azure Data Lake Storage Gen 2 as the data store. Select Managed Identity for Authentication method. Select the storage account (packadestoragev2) for Storage account name. Click on Test connection:

Figure 2.27 – Testing a managed identity connection in Data Factory

A successful test connection indicates that we can successfully connect to a storage account using a managed identity.

How it works…

A managed identity for the data factory was automatically created when the Data Factory account was created. We provided the Storage Blob Data Contributor permission on the Azure Data Lake storage account to the managed identity of Data Factory. Hence, Data Factory was successfully able to connect to the storage account in a secure way without using a key/password.

Azure Data Engineering Cookbook - Second Edition

By : Nagaraj Venkatesan, Ahmad Osama

Azure Data Engineering Cookbook - Second Edition

By: Nagaraj Venkatesan, Ahmad Osama

Overview of this book

Related Content you might be interested in

Current Title:

Azure Data Engineering Cookbook - Second Edition

Limitless Analytics with Azure Synapse

Azure Data Factory Cookbook

Cloud Scale Analytics with Azure Data Services.

Accessing Blob storage accounts using managed identities

Getting ready

How to do it…

How it works…