Book Image

Microsoft Certified Azure Data Fundamentals (Exam DP-900) Certification Guide

By : Marcelo Leite

5 (1)

Book Image

Microsoft Certified Azure Data Fundamentals (Exam DP-900) Certification Guide

5 (1)

By: Marcelo Leite

Overview of this book

Passing the DP-900 Microsoft Azure Data Fundamentals exam opens the door to a myriad of opportunities for working with data services in the cloud. But it is not an easy exam and you'll need a guide to set you up for success and prepare you for a career in Microsoft Azure. Absolutely everything you need to pass the DP-900 exam is covered in this concise handbook. After an introductory chapter covering the core terms and concepts, you'll go through the various roles related to working with data in the cloud and learn the similarities and differences between relational and non-relational databases. This foundational knowledge is crucial, as you'll learn how to provision and deploy Azure's relational and non-relational services in detail later in the book. You'll also gain an understanding of how to glean insights with data analytics at both small and large scales, and how to visualize your insights with Power BI. Once you reach the end of the book, you'll be able to test your knowledge with practice tests with detailed explanations of the correct answers. By the end of this book, you will be armed with the knowledge and confidence to not only pass the DP-900 exam but also have a solid foundation from which to embark on a career in Azure data services.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Share your thoughts

Download a free PDF copy of this book

Part 1: Core Data Concepts

Part 1: Core Data Concepts

Free Chapter

Chapter 1: Understanding the Core Data Terminologies

Chapter 1: Understanding the Core Data Terminologies

Understanding the core data concepts

Describing a data solution

Defining the data type and proper storage

Understanding data ingestion

Sample questions and answers

Chapter 2: Exploring the Roles and Responsibilities in Data Domain

Chapter 2: Exploring the Roles and Responsibilities in Data Domain

Different workforces in a data domain

Tasks and tools for database administration profiles

Tasks and tools for data engineer profiles

Tasks and tools for the data analyst

Sample questions and answers

Chapter 3: Working with Relational Data

Chapter 3: Working with Relational Data

Exploring the characteristics of relational data

Exploring relational data structures

Introducing SQL

Describing the database components

Sample questions and answers

Chapter 4: Working with Non-Relational Data

Chapter 4: Working with Non-Relational Data

Exploring the characteristics of non-relational data

Understanding the types of non-relational data

Exploring NoSQL databases

Identifying non-relational database use cases

Sample questions and answers

Chapter 5: Exploring Data Analytics Concepts

Chapter 5: Exploring Data Analytics Concepts

Exploring data ingestion and processing

Exploring the analytical data store

Exploring an analytical data model

Exploring data visualization

Sample questions and answers

Part 2: Relational Data in Azure

Part 2: Relational Data in Azure

Chapter 6: Integrating Relational Data on Azure

Chapter 6: Integrating Relational Data on Azure

Exploring relational Azure data services

Sample questions and answers

Chapter 7: Provisioning and Configuring Relational Database Services in Azure

Chapter 7: Provisioning and Configuring Relational Database Services in Azure

Technical requirements

Provisioning relational Azure data services

Configuring relational databases on Azure

Sample questions and answers

Chapter 8: Querying Relational Data in Azure

Chapter 8: Querying Relational Data in Azure

Technical requirements

Introducing SQL on Azure

Querying relational data in Azure SQL Database

Querying relational data in Azure Database for PostgreSQL

Sample questions and answers

Part 3: Non-Relational Data in Azure

Part 3: Non-Relational Data in Azure

Chapter 9: Exploring Non-Relational Data Offerings in Azure

Chapter 9: Exploring Non-Relational Data Offerings in Azure

Exploring Azure non-relational data stores

Exploring Azure NoSQL databases

Sample questions and answers

Chapter 10: Provisioning and Configuring Non-Relational Data Services in Azure

Chapter 10: Provisioning and Configuring Non-Relational Data Services in Azure

Technical requirements

Provisioning non-relational data services

Creating a sample Azure Cosmos DB database

Provisioning an Azure storage account and Data Lake Storage

Sample questions and answers

Part 4: Analytics Workload on Azure

Part 4: Analytics Workload on Azure

Chapter 11: Components of a Modern Data Warehouse

Chapter 11: Components of a Modern Data Warehouse

Describing modern data warehousing

Exploring Azure data services for modern data warehouses

Real-time data analytics – Azure Stream Analytics, Azure Synapse Data Explorer, and Spark streaming

Sample questions and answers

Chapter 12: Provisioning and Configuring Large-Scale Data Analytics in Azure

Chapter 12: Provisioning and Configuring Large-Scale Data Analytics in Azure

Technical requirements

Understanding common practices for data loading

Sample questions and answers

Chapter 13: Working with Power BI

Chapter 13: Working with Power BI

Technical requirements

Introducing Power BI

The building blocks of Power BI

Exploring Power BI Desktop

Exploring Power BI Service

Power BI mobile app

Sample questions and answers

Chapter 14: DP-900 Mock Exam

Chapter 14: DP-900 Mock Exam

Practice test – questions

Practice test – answers and explanations

Index

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share your thoughts

Download a free PDF copy of this book

Customer Reviews

5 (1)

5 star

100%

4 star

0

3 star

0

2 star

0

1 star

0

Defining the data type and proper storage

Categorizing the data to identify its types and best solutions for your storage is an important process for a data solution, and not just for evaluating whether it is structured, unstructured, or semi-structured. In this section, you will learn about the characteristics of different types of data.

Characteristics of relational and non-relational databases

Relational databases are the most traditional and used database format, as they have an easy-to-understand design and a simple tabular data model like other simple platforms such as Excel spreadsheets. Relational databases have predefined schemas, which are the structures of their tables, containing columns, the data type of each column, and other parameters such as primary and secondary keys used in relationships.

However, relational databases with these rigid schemas can pose challenges, as presented in the following example.

Your CRM system has a database structure with a CUSTOMER table, where you intend to store customer data: CUSTOMER_ID, CUSTOMER_NAME, ADDRESS, MOBILE_PHONE, and ZIP_CODE. To do this, you start by creating a CUSTOMER table with five fields:

Figure 1.7 – Example of a CUSTOMER table in a relational database

Figure 1.7 – Example of a CUSTOMER table in a relational database

However, after setting up this table, you realize that you have clients that have more than one address and zip code, and even more than one mobile phone number. How can you solve this issue?

To face problems like this one, we can use normalization one more time. Normalization is done when there is a need to split a table (CUSTOMER, in this example) into more child tables that are correlated to the initial table.

Therefore, we can change the CUSTOMER table as follows:

Figure 1.8 – A relationship model in a transactional database

Figure 1.8 – A relationship model in a transactional database

Non-relational databases allow you to store data in its original format without having a predefined schema as in relational databases. The most common non-relational storage format is document storage, where each record in the database is an independent file. The benefit is that each file can have different and unique attributes.

On the other hand, the files being independent can present a challenge: data duplication.

Going back to our CUSTOMER entity example in a relational database, when two or more customers live at one address, the database records that relationship, and the normalized database only keeps one address record. But in a non-relational database, if two customers live at the same address, this address will be presented in the records of the first customer and the second customer as well, independently.

Let’s now analyze how this storage could be structured in a relational database, using the concept of normalization:

Figure 1.9 – Example of data structured into tables

Figure 1.9 – Example of data structured into tables

The preceding figure exemplifies the data stored within the relational model tables with the CUSTOMER, CUSTOMER_ADDRESS, and ADDRESS entities to understand the structure of a normalized table.

Now let’s analyze the same data in a CUSTOMER table, but in the format of a non-relational database:

## JSON FILE - CUSTOMER ##
{
  "CUSTOMER_ID": "0001",
  " CUSTOMER_NAME": 
  { 
    "FIRST_NAME": " MARK", 
    "LAST_NAME": " HUGGS" 
  },
  "ADDRESS": 
  {
    "STREET": "1200, Harper Str" 
   }
}
## JSON FILE – CUSTOMER2 ##
{
  "CUSTOMER_ID": "0002",
  " CUSTOMER_NAME": 
  { 
    "FIRST_NAME": " KRISTI", 
    "LAST_NAME": " LAMP" 
  },
  "ADDRESS": 
  {
    "STREET": "1200, Harper Str" 
   }
}

In the preceding example, we can see two records in a CUSTOMER table, with each record being a JSON document structured with the attributes of each customer.

Thus, we can observe that the same data can be stored in relational and non-relational structures.

Therefore, to decide between a relational or non-relational data storage solution, you must evaluate the behavior of the application or the user that will use that database, the relationships between the entities, and possible normalization processes.

Both relational and non-relational databases should be used primarily for transactional workloads. In the upcoming sections, we will understand the differences between these transactional workloads and analytical workloads.

A transactional workload

Relational and non-relational databases can be used as solutions for transactional workloads, which are the databases used to perform basic data storage operations: create, read, update, and delete (CRUD). Transactional operations must be done in sequence, with a transaction control that only confirms the conclusion of this transaction (a process called a commit) when the entire operation is successfully executed. If this does not occur, the transaction is canceled, and all processes are not performed, thus generating a process called rollback.

An important idea to help understand the difference between relational and non-relational databases is ACID, present in most database technologies. These properties are as follows:

Atomicity: This is the property that controls the transaction and defines whether it was successfully performed completely to commit or must be canceled by performing a rollback. Database technology should ensure atomicity.
Consistency: For a running transaction, it is important to evaluate consistency between the database state before receiving the data and the database state after receiving the data. For example, in a bank transfer, when funds are added to an account, those funds must have a source. Therefore, it is important to know this source and whether the fund’s source exit process has already been performed before confirming the inclusion in this new account.
Isolation: This property evaluates whether there are multiple executions of transactions similar to the current one and if so, it keeps the database in the same state. It then evaluates whether the execution of transactions was sequential. In the bank transfer example, if multiple transactions are sent simultaneously, it checks whether the amounts have already left the source for all transactions, or you need to review one by one, transaction per transaction.
Durability: This is responsible for evaluating whether a transaction remains in the committed database even if there is a failure during the process, such as a power outage or latency at the time of recording the record.

ACID properties are not unique to transactional databases; they are also found in analytic databases. At this point, the most important thing is to understand that these settings exist, and you can adjust them as per the requirements of your data solution use case.

Since we are talking about databases, let’s understand an acronym that is widely used to represent database software: DBMS.

Database management systems

Database management systems (DBMSs), which are database software, have ACID properties within their architecture, and in addition to performing these controls, they need to manage several complex situations. For example, if multiple users or systems try to access or modify database records, the database systems need to isolate transactions, perform all necessary validations quickly, and maintain the consistency of the data stored after the transaction is committed. For this, some DBMS technologies work with temporary transaction locks, so that actions are done sequentially. This lock is done during the process of an action executing in that record; for example, in an edit of a field in a table, the lock ends as soon as the commit is executed, confirming that transaction.

Some DBMSs are called distributed databases. These databases have their architecture distributed in different storage and processing locations, which can be on-premises in the company’s data center or a different data center in the cloud. Distributed database solutions are widely used to maintain consistency in databases that will serve applications in different geographic locations, but this consistency doesn’t need to be synchronous. For example, a mobile game can be played in the United States and Brazil, and the database of this game has some entities (categories, game modes, and so on) that must be shared among all players. But the transactions from the United States player do not necessarily need to appear to the player in Brazil in a real-time way; this transactional data will be synchronized from the United States to Brazil, but in an asynchronous process. Let’s understand this process next.

Eventual consistency

All transactions in distributed databases take longer to process than in undistributed databases because it is necessary to replicate the data across all nodes in this distributed system. So, to maintain an adequate replication speed, the distributed databases only synchronize the data that is needed. This is the concept of eventual consistency, which configures ACID to perform replication between the distributed nodes asynchronously, after the confirmation of the transaction on the main node of the database is created. This technique can lead to temporary inconsistencies between database nodes. Ideally, the application connected to a distributed database does not require a guarantee of data ordering. It means that the data relating to this eventual consistency may appear to users with an eventual delay as well. Distributed databases are widely used by social media platforms, for news feeds, likes, and shares, among other features.

Let’s use the following figure to understand the behavior of a database with eventual consistency:

Figure 1.10 – Diagram of an eventual consistency database

Figure 1.10 – Diagram of an eventual consistency database

The preceding diagram shows behavior that we can observe when querying information in a database with eventual consistency. Instead of fetching the ball in a sequential way, the hero who retrieved it made the query of the ball in a future frame, generating a momentary duplication of the ball. In the end, only one ball was retrieved, after the sync was done.

This is an analogy for an eventual consistency database, where queries do not need to be made on entities that are already synchronized between all replicas of the database, and sometimes, this momentary duplication happens until the asynchronous process data update is complete.

In addition to transactional, relational, or non-relational databases, we also have another data workload, the analytical workload, which we will address in the next section.

An analytical workload

The second category of data solutions is the analytical workloads. These analytical solutions are based on high-volume data processing platforms, optimized for querying and exploring, and not for CRUD transactions or with ACID properties. In analytical databases, we aggregate various data sources, such as more than one transactional database, as well as logs, files, images, videos, and everything that can generate information for a business analyst.

This raw data is processed and aggregated, thus generating summaries, trends, and predictions that can support decision-making.

An analytical workload can be based on a specific time or a sequence of dated events. In these workloads, it’s common to evaluate only the data that is relevant to the analysis. For example, if you have a sales system with a transactional database (source) with several tables recording all sales, products, categories, and customers, among others, it is important to evaluate which of these tables can be used for the analytical database (destination) and then perform the data connections.

To create an analytical database, it is necessary to perform data ingestion, a process of copying data from sources to the analytical base. For this, a technique called extract, transform, and load (ETL), or the more recent extract, load, and transform (ELT), is used. The following figure demonstrates this process with an example of a transactional database as the data source and the analytical database as the destination:

Figure 1.11 – Data flow between a transactional database and an analytical database

Figure 1.11 – Data flow between a transactional database and an analytical database

In the preceding diagram, we can see that transactional databases are storages of information systems that automate business processes. Analytical databases act on simple and advanced data analysis, using, for example, statistical models with the application of machine learning, a branch of artificial intelligence. The data ingestion process is an important process for assembling an analytical database that meets the data solution. In the next section, we will understand what data ingestion is and the different types of this ingestion.