Azure Data Engineer Associate Certification Guide

By : Newton Alex

Azure Data Engineer Associate Certification Guide

By: Newton Alex

Overview of this book

Azure is one of the leading cloud providers in the world, providing numerous services for data hosting and data processing. Most of the companies today are either cloud-native or are migrating to the cloud much faster than ever. This has led to an explosion of data engineering jobs, with aspiring and experienced data engineers trying to outshine each other. Gaining the DP-203: Azure Data Engineer Associate certification is a sure-fire way of showing future employers that you have what it takes to become an Azure Data Engineer. This book will help you prepare for the DP-203 examination in a structured way, covering all the topics specified in the syllabus with detailed explanations and exam tips. The book starts by covering the fundamentals of Azure, and then takes the example of a hypothetical company and walks you through the various stages of building data engineering solutions. Throughout the chapters, you'll learn about the various Azure components involved in building the data systems and will explore them using a wide range of real-world use cases. Finally, you’ll work on sample questions and answers to familiarize yourself with the pattern of the exam. By the end of this Azure book, you'll have gained the confidence you need to pass the DP-203 exam with ease and land your dream job in data engineering.

Preface

Who this book is for

What this book covers

Download the example code files

Download the color images

Get in touch

Reviews

Share Your Thoughts

Part 1: Azure Basics

Free Chapter

Chapter 1: Introducing Azure Basics

Technical requirements

Introducing the Azure portal

Exploring Azure accounts, subscriptions, and resource groups

Introducing Azure Services

Exploring Azure VMs

Exploring Azure Storage

Exploring Azure Networking (VNet)

Exploring Azure Compute

Summary

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Technical requirements

Designing an Azure data lake

Selecting the right file types for storage

Choosing the right file types for analytical queries

Designing storage for efficient querying

Designing storage for data pruning

Designing folder structures for data transformation

Designing a distribution strategy

Designing a data archiving solution

Summary

Chapter 3: Designing a Partition Strategy

Understanding the basics of partitioning

Designing a partition strategy for files

Designing a partition strategy for analytical workloads

Designing a partition strategy for efficiency/performance

Designing a partition strategy for Azure Synapse Analytics

Identifying when partitioning is needed in ADLS Gen2

Summary

Chapter 4: Designing the Serving Layer

Technical requirements

Learning the basics of data modeling and schemas

Designing Star and Snowflake schemas

Designing SCDs

Designing a solution for temporal data

Designing a dimensional hierarchy

Designing for incremental loading

Designing analytical stores

Designing metastores in Azure Synapse Analytics and Azure Databricks

Summary

Chapter 5: Implementing Physical Data Storage Structures

Technical requirements

Getting started with Azure Synapse Analytics

Implementing compression

Implementing partitioning

Implementing horizontal partitioning or sharding

Implementing distributions

Implementing different table geometries with Azure Synapse Analytics pools

Implementing data redundancy

Implementing data archiving

Summary

Chapter 6: Implementing Logical Data Structures

Technical requirements

Building a temporal data solution

Building a slowly changing dimension

Building a logical folder structure

Implementing file and folder structures for efficient querying and data pruning

Building external tables

Summary

Chapter 7: Implementing the Serving Layer

Technical requirements

Delivering data in a relational star schema

Implementing a dimensional hierarchy

Maintaining metadata

Summary

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Technical requirements

Transforming data by using Apache Spark

Transforming data by using T-SQL

Transforming data by using ADF

Transforming data by using Azure Synapse pipelines

Transforming data by using Stream Analytics

Cleansing data

Splitting data

Shredding JSON

Encoding and decoding data

Configuring error handling for the transformation

Normalizing and denormalizing values

Transforming data by using Scala

Performing Exploratory Data Analysis (EDA)

Summary

Chapter 9: Designing and Developing a Batch Processing Solution

Technical requirements

Designing a batch processing solution

Developing batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks

Creating data pipelines

Integrating Jupyter/Python notebooks into a data pipeline

Designing and implementing incremental data loads

Designing and developing slowly changing dimensions

Handling duplicate data

Handling missing data

Handling late-arriving data

Upserting data

Regressing to a previous state

Introducing Azure Batch

Configuring the batch size

Scaling resources

Configuring batch retention

Designing and configuring exception handling

Handling security and compliance requirements

Summary

Chapter 10: Designing and Developing a Stream Processing Solution

Technical requirements

Designing a stream processing solution

Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs

Processing data using Spark Structured Streaming

Monitoring for performance and functional regressions

Processing time series data

Designing and creating windowed aggregates

Configuring checkpoints/watermarking during processing

Replaying archived stream data

Transformations using streaming analytics

Handling schema drifts

Processing across partitions

Processing within one partition

Scaling resources

Handling interruptions

Designing and configuring exception handling

Upserting data

Designing and creating tests for data pipelines

Optimizing pipelines for analytical or transactional purposes

Summary

Chapter 11: Managing Batches and Pipelines

Technical requirements

Triggering batches

Handling failed Batch loads

Validating Batch loads

Scheduling data pipelines in Data Factory/Synapse pipelines

Managing data pipelines in Data Factory/Synapse pipelines

Managing Spark jobs in a pipeline

Implementing version control for pipeline artifacts

Summary

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Technical requirements

Introducing the security and privacy requirements

Designing and implementing data encryption for data at rest and in transit

Designing and implementing a data auditing strategy

Designing and implementing a data masking strategy

Designing and implementing Azure role-based access control and a POSIX-like access control list for Data Lake Storage Gen2

Designing and implementing row-level and column-level security

Designing and implementing a data retention policy

Designing to purge data based on business requirements

Managing identities, keys, and secrets across different data platform technologies

Implementing secure endpoints (private and public)

Implementing resource tokens in Azure Databricks

Loading a DataFrame with sensitive information

Writing encrypted data to tables or Parquet files

Designing for data privacy and managing sensitive information

Summary

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Technical requirements

Implementing logging used by Azure Monitor

Configuring monitoring services

Understanding custom logging options

Interpreting Azure Monitor metrics and logs

Measuring the performance of data movement

Monitoring data pipeline performance

Monitoring and updating statistics about data across a system

Measuring query performance

Interpreting a Spark DAG

Monitoring cluster performance

Scheduling and monitoring pipeline tests

Summary

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Technical requirements

Compacting small files

Rewriting user-defined functions (UDFs)

Handling skews in data

Handling data spills

Tuning shuffle partitions

Finding shuffling in a pipeline

Optimizing resource management

Tuning queries by using indexers

Tuning queries by using cache

Optimizing pipelines for analytical or transactional purposes

Optimizing pipelines for descriptive versus analytical workloads

Troubleshooting a failed Spark job

Troubleshooting a failed pipeline run

Summary

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Exploring the question formats

Case study-based questions

Scenario-based questions

Direct questions

Ordering sequence questions

Code segment questions

Sample questions from the Design and Implement Data Storage section

Sample questions from the Design and Develop Data Processing section

Sample questions from the Design and Implement Data Security section

Sample questions from the Monitor and Optimize Data Storage and Data Processing section

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What this book covers

The chapters in this book are designed around the skill sets listed by Microsoft for the coursework:

Exam DP-203: Data Engineering on Microsoft Azure – Skills Measured

https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4MbYT

Chapter 1, Introducing Azure Basics, introduces the audience to Azure and explains its general capabilities. This is a refresher chapter designed to renew our understanding of some of the core Azure concepts, including VMs, data storage, compute options, the Azure portal, accounts, and subscriptions. We will be building on top of these technologies in future chapters.

Chapter 2, Designing a Data Storage Structure, focuses on the various storage solutions available in Azure. We will cover topics such as Azure Data Lake Storage, Blob storage, and SQL- and NoSQL-based storage. We will also get into the details of when to choose what storage and how to optimize this storage using techniques such as data pruning, data distribution, and data archiving.

Chapter 3, Designing a Partition Strategy, explores the different partition strategies available. We will focus on how to efficiently split and store the data for different types of workloads and will see some recommendations on when and how to partition the data for different use cases, including analytics and batch processing.

Chapter 4, Designing the Serving Layer, is dedicated to the design of the different types of schemas, such as the Star and Snowflake schemas. We will focus on designing slowly-changing dimensions, building a dimensional hierarchy, temporal solutions, and other such advanced topics. We will also focus on sharing data between the different compute technologies, including Azure Databricks and Azure Synapse, using metastores.

Chapter 5, Implementing Physical Data Storage Structures, focuses on the implementation of lower-level aspects of data storage, including compression, sharding, data distribution, indexing, data redundancy, archiving, storage tiers, and replication, with the help of examples.

Chapter 6, Implementing Logical Data Structures, focuses on the implementation of temporal data structures and slowly-changing dimensions using Azure Data Factory (ADF), building folder structures for analytics, as well as streaming and other data to improve query performance and to assist with data pruning.

Chapter 7, Implementing the Serving Layer, focuses on implementing a relational star schema, storing files in different formats, such as Parquet and ORC, and building and using a metastore between Synapse and Azure Databricks.

Chapter 8, Ingesting and Transforming Data, introduces the various Azure data processing technologies, including Synapse Analytics, ADF, Azure Databricks, and Stream Analytics. We will focus on the various data transformations that can be performed using T-SQL, Spark, and ADF. We will also look into aspects of data pipelines, such as cleansing the data, parsing data, encoding and decoding data, normalizing and denormalizing values, error handling, and basic data exploration techniques.

Chapter 9, Designing and Developing a Batch Processing Solution, focuses on building an end-to-end batch processing system. We will cover techniques for handling incremental data, slowly-changing dimensions, missing data, late-arriving data, duplicate data, and more. We will also cover security and compliance aspects, along with techniques to debug issues in data pipelines.

Chapter 10, Designing and Developing a Stream Processing Solution, is dedicated to stream processing. We will build end-to-end streaming systems using Stream Analytics, Event Hubs, and Azure Databricks. We will explore the various windowed aggregation options available and learn how to handle schema drifts, along with time series data, partitions, checkpointing, replaying data, and so on. We will also cover techniques to handle interruptions, scale the resources, error handling, and so on.

Chapter 11, Managing Batches and Pipelines, is dedicated to managing and debugging the batch and streaming pipelines. We will look into the techniques to configure and trigger jobs, and to debug failed jobs. We will dive deeper into the features available in the data factory and Synapse pipelines to schedule the pipelines. We will also look at implementing version control in ADF.

Chapter 12, Designing Security for Data Policies and Standards, focuses on how to design and implement data encryption, both at rest and in transit, data auditing, data masking, data retention, data purging, and so on. In addition, we will also learn about the RBAC features of ADLS Gen2 storage and explore the row- and column-level security in Azure SQL and Synapse Analytics. We will deep dive into techniques for handling managed identities, keys, secrets, resource tokens, and so on and learn how to handle sensitive information.

Chapter 13, Monitoring Data Storage and Data Processing, focuses on logging, configuring monitoring services, measuring performance, integrating with CI/CD systems, custom logging and monitoring options, querying using Kusto, and finally, tips on debugging Spark jobs.

Chapter 14, Optimizing and Troubleshooting Data Storage and Data Processing, focuses on tuning and debugging Spark or Synapse queries. We will dive deeper into query-level debugging, including how to handle shuffles, UDFs, data skews, indexing, and cache management. We will also spend some time troubleshooting Spark and Synapse pipelines.

Chapter 15, Sample Questions with Solutions, is where we put everything we have learned into practice. We will explore a bunch of real-world problems and learn how to use the information we learned in this book to answer the certification questions. This will help you prepare for both the exam and real-world problems.

Note

All the information provided in this book is based on public Azure documents. The author is neither associated with the Azure Certification team nor has access to any of the Azure Certification questions, other than what is publicly made available by Microsoft.

Azure Data Engineer Associate Certification Guide

By : Newton Alex

Azure Data Engineer Associate Certification Guide

By: Newton Alex

Overview of this book

Related Content you might be interested in

Current Title:

Azure Data Engineer Associate Certification Guide

Azure Synapse Analytics Cookbook

Limitless Analytics with Azure Synapse

Cloud Scale Analytics with Azure Data Services.

What this book covers