Book Image

Data Modeling for Azure Data Services

By : Peter ter Braake
Book Image

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Data is at the heart of all applications and forms the foundation of modern data-driven businesses. With the multitude of data-related use cases and the availability of different data services, choosing the right service and implementing the right design becomes paramount to successful implementation. Data Modeling for Azure Data Services starts with an introduction to databases, entity analysis, and normalizing data. The book then shows you how to design a NoSQL database for optimal performance and scalability and covers how to provision and implement Azure SQL DB, Azure Cosmos DB, and Azure Synapse SQL Pool. As you progress through the chapters, you'll learn about data analytics, Azure Data Lake, and Azure SQL Data Warehouse and explore dimensional modeling, data vault modeling, along with designing and implementing a Data Lake using Azure Storage. You'll also learn how to implement ETL with Azure Data Factory. By the end of this book, you'll have a solid understanding of which Azure data services are the best fit for your model and how to implement the best design for your solution.
Table of Contents (16 chapters)
1
Section 1 – Operational/OLTP Databases
8
Section 2 – Analytics with a Data Lake and Data Warehouse
13
Section 3 – ETL with Azure Data Factory

Understanding relational theory

Let's look in a bit more depth into relational databases. They form a big part of everyday processes working with data. They also constitute the majority of this book, and what you will learn about relational theory will also help in using non-relational databases.

A relational database stores data in tables according to Codd's rules of the relational model.

Relational theory has two important pillars:

  • Elements of a set are not ordered.
  • All elements in a set are unique.

Pillar 1 – Elements of a set are not ordered

The first pillar from set-based theory is both neat and, on occasion, a bit troublesome.

I once had to visit a customer of the company I worked for because he had an issue. Whenever he added a new order to his database manually, he checked to see whether everything was OK by opening the Orders table and scrolling down to the last row. Whenever he failed to find his new order, he would enter it again. At the end of the month, he always had too many orders in his database!

In this example, my customer assumed that the row entered last would automatically be the last row in the table or the row showed last on his screen. He assumed opening a table in his application would retrieve the rows (orders) in the same order as in which they were entered. Both assumptions are incorrect (although, in theory, the latter depends on the application in which you open a table and not on the database).

A database doesn't store rows in any specific order. This provides both flexibility and write performance. However, a database administrator or developer could create indexes on tables. By using indexes, we can speed up data retrieval (query) times, in the same way an index in a book makes it quicker to find a specific topic. Certain types of indexes store the data ordered, giving the database the option to use the sorting order to quickly find rows that we are looking for. That index might be based on the customer's last name and not on the date the order was inserted in the database. Furthermore, a Database Administrator (DBA) might decide to replace the index with another one. In that case, the data is stored in a different order than it was previously. You will learn about indexes in the various chapters that zoom in on specific Azure services, such as Azure SQL Database.

Without indexes, rows are stored without a specific sorting order. With indexes, that might change depending on the index you use. But if you query the database, it is not guaranteed that the database will actually use the index. When writing a query, you can explicitly specify the sorting orders for all the rows you want returned. That sorting order is now guaranteed. If you don't ask for a specific sorting order, you will get the rows back in a random order. That order is determined by whether the database decided to use an index and additional factors, such as fragmentation in your database.

When using a relational database, the lack of order is sometimes confusing. People almost always think with some sort of ordering in mind, such as the last, best, and biggest. As long as you remember to explicitly specify a sorting order when querying a database, you will be fine. If you don't need your rows to be in a specific order, don't request it. The sorting of (large) datasets is always expensive.

Tip

When you require rows to be in a specific order, you need to specify that sorting order explicitly in your query.

Pillar 2 – All elements in a set are unique

In mathematics, a set is a collection of unique elements. For example, the set of uneven numbers consists of all uneven numbers. The number 3 is part of that set. It does not appear twice in the set. Each uneven number is in the set only once. This is called the axiom of unicity.

When applying mathematical set-based theory to everyday live databases, unicity comes in very handy. A table storing customers is the set of all customers. In a business-to-business scenario, a business is either a customer or it is not a customer. When it is a customer, we store the details in the Customer table, otherwise, we don't. We do not store every potential customer and we preferably do not store the same customer twice.

Unicity demands that each element of a set is unique.

Suppose a company is stored twice in your database. That will lead to a myriad of problems. To start with, simply counting the number of rows to determine how many customers you have will lead to the wrong result. A simple report showing sales by customer will also no longer be correct, nor will average sales by customer. The list goes on. You want customers to be unique within your database. As a different example, have a look at Figure 1.3. Can you tell how many patients live in Amsterdam?

Figure 1.3 – Patients with duplicate rows

Figure 1.3 – Patients with duplicate rows

Apart from functional reasons for keeping rows unique in a table, there are technical reasons as well. Suppose you have a table like the one in Figure 1.3. You decide that the first two rows describe the same patient, so you want to remove one of the rows from the table. You cannot do that with SQL. You can delete the row(s) where PatientID equals 1. The database will delete both rows. However, removing Peter or removing rows with Amsterdam also leads to both rows being removed. To delete a single row, you must be able to uniquely identify that row. In the example shown, there is no characteristic (column) separating row 1 from row 2. You cannot delete one without deleting the other at the same time.

Some RDBMS systems keep track of rows by adding a unique identifier under the covers. In that case, you can retrieve that value and delete one row using this value. Microsoft SQL Server is not one of those systems. You need to make sure rows are unique (or live with the consequences).

To make sure each row in a table is unique, you (the database administrator) should check for each newly inserted row that does not already exist. The simplest way you can do that is by comparing the new row column by column to see whether you already have a row with identical values for each column. This is easy enough, but really time-consuming. With 10 million rows in a table and 40 columns, this means you need to perform 400 million comparisons. You need to do that each time you enter a new row or change an existing row. You could utilize more optimized search algorithms but they will still not make this a fast operation.

Note

To enforce the uniqueness of rows, we use keys. A key is a column, or a couple of columns together, in which we store unique values.

A key is a column, or a combination of columns, that stores unique values by the nature of the data stored in the column. For instance, each car has a unique license plate. When you create a table that stores all the vehicles within your company, each vehicle has its own unique license plate. You can identify a specific car by its license plate, independent of the values of other columns within the table.

The context in which we store data can be important when it comes to columns storing unique values. Consider a table where we store a row of information each time a driver has damage to their car. Some people will have damage more often than others. Each time we store the details about the damage, we keep track of which vehicle has damage by storing the license plate. In the Vehicle table, the license plate will be unique, while in the Damage table, it will not be unique. You can see this in Figure 1.4. The Vehicle table holds one row per car. Each row describes a unique car with its unique license plate. The Damage table shows the claims of damage done to the car. Some cars have never had any claims made in relation to them, while others can be found in the table multiple times. This means that the same license plate is no longer unique:

Figure 1.4 – Vehicles and damage

Figure 1.4 – Vehicles and damage

Let's look closer now into how defining keys helps with keeping rows unique in your database.