Book Image

Azure Data Engineer Associate Certification Guide

By : Newton Alex

Book Image

Azure Data Engineer Associate Certification Guide

By: Newton Alex

Overview of this book

Azure is one of the leading cloud providers in the world, providing numerous services for data hosting and data processing. Most of the companies today are either cloud-native or are migrating to the cloud much faster than ever. This has led to an explosion of data engineering jobs, with aspiring and experienced data engineers trying to outshine each other. Gaining the DP-203: Azure Data Engineer Associate certification is a sure-fire way of showing future employers that you have what it takes to become an Azure Data Engineer. This book will help you prepare for the DP-203 examination in a structured way, covering all the topics specified in the syllabus with detailed explanations and exam tips. The book starts by covering the fundamentals of Azure, and then takes the example of a hypothetical company and walks you through the various stages of building data engineering solutions. Throughout the chapters, you'll learn about the various Azure components involved in building the data systems and will explore them using a wide range of real-world use cases. Finally, you’ll work on sample questions and answers to familiarize yourself with the pattern of the exam. By the end of this Azure book, you'll have gained the confidence you need to pass the DP-203 exam with ease and land your dream job in data engineering.

Preface

Who this book is for

What this book covers

Download the example code files

Download the color images

Share Your Thoughts

Part 1: Azure Basics

Part 1: Azure Basics

Free Chapter

Chapter 1: Introducing Azure Basics

Chapter 1: Introducing Azure Basics

Technical requirements

Introducing the Azure portal

Exploring Azure accounts, subscriptions, and resource groups

Introducing Azure Services

Exploring Azure VMs

Exploring Azure Storage

Exploring Azure Networking (VNet)

Exploring Azure Compute

Part 2: Data Storage

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Chapter 2: Designing a Data Storage Structure

Technical requirements

Designing an Azure data lake

Selecting the right file types for storage

Choosing the right file types for analytical queries

Designing storage for efficient querying

Designing storage for data pruning

Designing folder structures for data transformation

Designing a distribution strategy

Designing a data archiving solution

Chapter 3: Designing a Partition Strategy

Chapter 3: Designing a Partition Strategy

Understanding the basics of partitioning

Designing a partition strategy for files

Designing a partition strategy for analytical workloads

Designing a partition strategy for efficiency/performance

Designing a partition strategy for Azure Synapse Analytics

Identifying when partitioning is needed in ADLS Gen2

Chapter 4: Designing the Serving Layer

Chapter 4: Designing the Serving Layer

Technical requirements

Learning the basics of data modeling and schemas

Designing Star and Snowflake schemas

Designing a solution for temporal data

Designing a dimensional hierarchy

Designing for incremental loading

Designing analytical stores

Designing metastores in Azure Synapse Analytics and Azure Databricks

Chapter 5: Implementing Physical Data Storage Structures

Chapter 5: Implementing Physical Data Storage Structures

Technical requirements

Getting started with Azure Synapse Analytics

Implementing compression

Implementing partitioning

Implementing horizontal partitioning or sharding

Implementing distributions

Implementing different table geometries with Azure Synapse Analytics pools

Implementing data redundancy

Implementing data archiving

Chapter 6: Implementing Logical Data Structures

Chapter 6: Implementing Logical Data Structures

Technical requirements

Building a temporal data solution

Building a slowly changing dimension

Building a logical folder structure

Implementing file and folder structures for efficient querying and data pruning

Building external tables

Chapter 7: Implementing the Serving Layer

Chapter 7: Implementing the Serving Layer

Technical requirements

Delivering data in a relational star schema

Implementing a dimensional hierarchy

Maintaining metadata

Part 3: Design and Develop Data Processing (25-30%)

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Chapter 8: Ingesting and Transforming Data

Technical requirements

Transforming data by using Apache Spark

Transforming data by using T-SQL

Transforming data by using ADF

Transforming data by using Azure Synapse pipelines

Transforming data by using Stream Analytics

Encoding and decoding data

Configuring error handling for the transformation

Normalizing and denormalizing values

Transforming data by using Scala

Performing Exploratory Data Analysis (EDA)

Chapter 9: Designing and Developing a Batch Processing Solution

Chapter 9: Designing and Developing a Batch Processing Solution

Technical requirements

Designing a batch processing solution

Developing batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks

Creating data pipelines

Integrating Jupyter/Python notebooks into a data pipeline

Designing and implementing incremental data loads

Designing and developing slowly changing dimensions

Handling duplicate data

Handling missing data

Handling late-arriving data

Regressing to a previous state

Introducing Azure Batch

Configuring the batch size

Scaling resources

Configuring batch retention

Designing and configuring exception handling

Handling security and compliance requirements

Chapter 10: Designing and Developing a Stream Processing Solution

Chapter 10: Designing and Developing a Stream Processing Solution

Technical requirements

Designing a stream processing solution

Developing a stream processing solution using ASA, Azure Databricks, and Azure Event Hubs

Processing data using Spark Structured Streaming

Monitoring for performance and functional regressions

Processing time series data

Designing and creating windowed aggregates

Configuring checkpoints/watermarking during processing

Replaying archived stream data

Transformations using streaming analytics

Handling schema drifts

Processing across partitions

Processing within one partition

Scaling resources

Handling interruptions

Designing and configuring exception handling

Designing and creating tests for data pipelines

Optimizing pipelines for analytical or transactional purposes

Chapter 11: Managing Batches and Pipelines

Chapter 11: Managing Batches and Pipelines

Technical requirements

Triggering batches

Handling failed Batch loads

Validating Batch loads

Scheduling data pipelines in Data Factory/Synapse pipelines

Managing data pipelines in Data Factory/Synapse pipelines

Managing Spark jobs in a pipeline

Implementing version control for pipeline artifacts

Part 4: Design and Implement Data Security (10-15%)

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Chapter 12: Designing Security for Data Policies and Standards

Technical requirements

Introducing the security and privacy requirements

Designing and implementing data encryption for data at rest and in transit

Designing and implementing a data auditing strategy

Designing and implementing a data masking strategy

Designing and implementing Azure role-based access control and a POSIX-like access control list for Data Lake Storage Gen2

Designing and implementing row-level and column-level security

Designing and implementing a data retention policy

Designing to purge data based on business requirements

Managing identities, keys, and secrets across different data platform technologies

Implementing secure endpoints (private and public)

Implementing resource tokens in Azure Databricks

Loading a DataFrame with sensitive information

Writing encrypted data to tables or Parquet files

Designing for data privacy and managing sensitive information

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Chapter 13: Monitoring Data Storage and Data Processing

Technical requirements

Implementing logging used by Azure Monitor

Configuring monitoring services

Understanding custom logging options

Interpreting Azure Monitor metrics and logs

Measuring the performance of data movement

Monitoring data pipeline performance

Monitoring and updating statistics about data across a system

Measuring query performance

Interpreting a Spark DAG

Monitoring cluster performance

Scheduling and monitoring pipeline tests

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Technical requirements

Compacting small files

Rewriting user-defined functions (UDFs)

Handling skews in data

Handling data spills

Tuning shuffle partitions

Finding shuffling in a pipeline

Optimizing resource management

Tuning queries by using indexers

Tuning queries by using cache

Optimizing pipelines for analytical or transactional purposes

Optimizing pipelines for descriptive versus analytical workloads

Troubleshooting a failed Spark job

Troubleshooting a failed pipeline run

Part 6: Practice Exercises

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Chapter 15: Sample Questions with Solutions

Exploring the question formats

Case study-based questions

Scenario-based questions

Direct questions

Ordering sequence questions

Code segment questions

Sample questions from the Design and Implement Data Storage section

Sample questions from the Design and Develop Data Processing section

Sample questions from the Design and Implement Data Security section

Sample questions from the Monitor and Optimize Data Storage and Data Processing section

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Tuning shuffle partitions

Spark uses a technique called shuffle to move data between its executors or nodes while performing operations such as join, union, groupby, and reduceby. The shuffle operation is very expensive as it involves the movement of data between nodes. Hence, it is usually preferable to reduce the amount of shuffle involved in a Spark query. The number of partition splits that Spark performs while shuffling data is determined by the following configuration:

spark.conf.set("spark.sql.shuffle.partitions",200)

200 is the default value and you can tune it to a number that suits your query the best. If you have too much data and too few partitions, this might result in longer tasks. But, on the other hand, if you have too little data and too many shuffle partitions, the overhead of shuffle tasks will degrade performance. So, you will have to run your query multiple times with different shuffle partition numbers to arrive at an optimum number.

You can...