Book Image

Azure Data Engineer Associate Certification Guide

By : Newton Alex
Book Image

Azure Data Engineer Associate Certification Guide

By: Newton Alex

Overview of this book

Azure is one of the leading cloud providers in the world, providing numerous services for data hosting and data processing. Most of the companies today are either cloud-native or are migrating to the cloud much faster than ever. This has led to an explosion of data engineering jobs, with aspiring and experienced data engineers trying to outshine each other. Gaining the DP-203: Azure Data Engineer Associate certification is a sure-fire way of showing future employers that you have what it takes to become an Azure Data Engineer. This book will help you prepare for the DP-203 examination in a structured way, covering all the topics specified in the syllabus with detailed explanations and exam tips. The book starts by covering the fundamentals of Azure, and then takes the example of a hypothetical company and walks you through the various stages of building data engineering solutions. Throughout the chapters, you'll learn about the various Azure components involved in building the data systems and will explore them using a wide range of real-world use cases. Finally, you’ll work on sample questions and answers to familiarize yourself with the pattern of the exam. By the end of this Azure book, you'll have gained the confidence you need to pass the DP-203 exam with ease and land your dream job in data engineering.
Table of Contents (23 chapters)
1
Part 1: Azure Basics
3
Part 2: Data Storage
10
Part 3: Design and Develop Data Processing (25-30%)
15
Part 4: Design and Implement Data Security (10-15%)
17
Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)
20
Part 6: Practice Exercises

What this book covers

The chapters in this book are designed around the skill sets listed by Microsoft for the coursework:

Exam DP-203: Data Engineering on Microsoft Azure – Skills Measured

https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4MbYT

Chapter 1, Introducing Azure Basics, introduces the audience to Azure and explains its general capabilities. This is a refresher chapter designed to renew our understanding of some of the core Azure concepts, including VMs, data storage, compute options, the Azure portal, accounts, and subscriptions. We will be building on top of these technologies in future chapters.

Chapter 2, Designing a Data Storage Structure, focuses on the various storage solutions available in Azure. We will cover topics such as Azure Data Lake Storage, Blob storage, and SQL- and NoSQL-based storage. We will also get into the details of when to choose what storage and how to optimize this storage using techniques such as data pruning, data distribution, and data archiving.

Chapter 3, Designing a Partition Strategy, explores the different partition strategies available. We will focus on how to efficiently split and store the data for different types of workloads and will see some recommendations on when and how to partition the data for different use cases, including analytics and batch processing.

Chapter 4, Designing the Serving Layer, is dedicated to the design of the different types of schemas, such as the Star and Snowflake schemas. We will focus on designing slowly-changing dimensions, building a dimensional hierarchy, temporal solutions, and other such advanced topics. We will also focus on sharing data between the different compute technologies, including Azure Databricks and Azure Synapse, using metastores.

Chapter 5, Implementing Physical Data Storage Structures, focuses on the implementation of lower-level aspects of data storage, including compression, sharding, data distribution, indexing, data redundancy, archiving, storage tiers, and replication, with the help of examples.

Chapter 6, Implementing Logical Data Structures, focuses on the implementation of temporal data structures and slowly-changing dimensions using Azure Data Factory (ADF), building folder structures for analytics, as well as streaming and other data to improve query performance and to assist with data pruning.

Chapter 7, Implementing the Serving Layer, focuses on implementing a relational star schema, storing files in different formats, such as Parquet and ORC, and building and using a metastore between Synapse and Azure Databricks.

Chapter 8, Ingesting and Transforming Data, introduces the various Azure data processing technologies, including Synapse Analytics, ADF, Azure Databricks, and Stream Analytics. We will focus on the various data transformations that can be performed using T-SQL, Spark, and ADF. We will also look into aspects of data pipelines, such as cleansing the data, parsing data, encoding and decoding data, normalizing and denormalizing values, error handling, and basic data exploration techniques.

Chapter 9, Designing and Developing a Batch Processing Solution, focuses on building an end-to-end batch processing system. We will cover techniques for handling incremental data, slowly-changing dimensions, missing data, late-arriving data, duplicate data, and more. We will also cover security and compliance aspects, along with techniques to debug issues in data pipelines.

Chapter 10, Designing and Developing a Stream Processing Solution, is dedicated to stream processing. We will build end-to-end streaming systems using Stream Analytics, Event Hubs, and Azure Databricks. We will explore the various windowed aggregation options available and learn how to handle schema drifts, along with time series data, partitions, checkpointing, replaying data, and so on. We will also cover techniques to handle interruptions, scale the resources, error handling, and so on.

Chapter 11, Managing Batches and Pipelines, is dedicated to managing and debugging the batch and streaming pipelines. We will look into the techniques to configure and trigger jobs, and to debug failed jobs. We will dive deeper into the features available in the data factory and Synapse pipelines to schedule the pipelines. We will also look at implementing version control in ADF.

Chapter 12, Designing Security for Data Policies and Standards, focuses on how to design and implement data encryption, both at rest and in transit, data auditing, data masking, data retention, data purging, and so on. In addition, we will also learn about the RBAC features of ADLS Gen2 storage and explore the row- and column-level security in Azure SQL and Synapse Analytics. We will deep dive into techniques for handling managed identities, keys, secrets, resource tokens, and so on and learn how to handle sensitive information.

Chapter 13, Monitoring Data Storage and Data Processing, focuses on logging, configuring monitoring services, measuring performance, integrating with CI/CD systems, custom logging and monitoring options, querying using Kusto, and finally, tips on debugging Spark jobs.

Chapter 14, Optimizing and Troubleshooting Data Storage and Data Processing, focuses on tuning and debugging Spark or Synapse queries. We will dive deeper into query-level debugging, including how to handle shuffles, UDFs, data skews, indexing, and cache management. We will also spend some time troubleshooting Spark and Synapse pipelines.

Chapter 15, Sample Questions with Solutions, is where we put everything we have learned into practice. We will explore a bunch of real-world problems and learn how to use the information we learned in this book to answer the certification questions. This will help you prepare for both the exam and real-world problems.

Note

All the information provided in this book is based on public Azure documents. The author is neither associated with the Azure Certification team nor has access to any of the Azure Certification questions, other than what is publicly made available by Microsoft.