Book Image

Data Modeling for Azure Data Services

By : Peter ter Braake
Book Image

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Data is at the heart of all applications and forms the foundation of modern data-driven businesses. With the multitude of data-related use cases and the availability of different data services, choosing the right service and implementing the right design becomes paramount to successful implementation. Data Modeling for Azure Data Services starts with an introduction to databases, entity analysis, and normalizing data. The book then shows you how to design a NoSQL database for optimal performance and scalability and covers how to provision and implement Azure SQL DB, Azure Cosmos DB, and Azure Synapse SQL Pool. As you progress through the chapters, you'll learn about data analytics, Azure Data Lake, and Azure SQL Data Warehouse and explore dimensional modeling, data vault modeling, along with designing and implementing a Data Lake using Azure Storage. You'll also learn how to implement ETL with Azure Data Factory. By the end of this book, you'll have a solid understanding of which Azure data services are the best fit for your model and how to implement the best design for your solution.
Table of Contents (16 chapters)
1
Section 1 – Operational/OLTP Databases
8
Section 2 – Analytics with a Data Lake and Data Warehouse
13
Section 3 – ETL with Azure Data Factory

Keys

Databases do not check rows for uniqueness automatically. Checking rows column by column is too expensive. The database developer needs to define a key. After you define a key, the database will use this to check for uniqueness. You need to define a key per table. You need to decide for each table whether or not you require (want) a key. You then need to decide which column to use as the key.

Types of keys

There are different types of keys. Let's have a look at some definitions for keys.

Candidate keys

Candidate keys are all columns within a table that logically store unique values. A table may have zero, one, or more columns that, by their nature, should store unique values in the context of the table they are in. Each column that adheres to the definition of a key is a candidate key. That makes the license plate in the Vehicle table of Figure 1.4 a candidate key. Other examples of candidate keys can be a Social Security Number (SSN) in a Persons table or an ISBN number in a Book table. An SSN or an ISBN uniquely identifies a person or a book, just like a license plate uniquely identifies a vehicle.

A key can be made up of several columns. In the Netherlands, the combination of postal code and house number is unique for an address. Websites can use this by automatically filling in a street name and city after you enter just your postal code and house number.

We can distinguish between two types of candidate keys – logical, or business, keys and technical, or surrogate, keys.

Logical keys

A logical key or business key is a key that stores values (information) with real meaning that you need to store, regardless of whether it can function as a key. No matter how you design your database, irrespective of whether you plan to use keys, the business key will be part of the data you are going to store. For example, all companies are obliged to store the SSN of their employees. Since the SSN column is part of your table, you can use it as a key.

In Data Vault theory (Chapter 9, Data Vault Modeling), business keys play a central role. Knowing your data (and the processes that use the data) becomes crucial. The business key is the characteristic that people use in their everyday lives to refer to objects. A course about data modeling is referred to (searched and found by) its name. The name is the business key. Names can be unique, but very often are not. That makes them dangerous to use as keys. Sometimes, however, they work OK.

Technical/surrogate keys

In some cases, there is not a single candidate key to be found. Adding an extra column and using unique values for it then becomes an option. That way, you are adding (creating) a candidate key. This is also an option when you do have candidate keys to start with. This type of key is called a technical key or surrogate key. Its contents are (normally) meaningless but unique.

Most database systems have features to easily create surrogate keys. For instance, a sequence is an object in the database that allows you to generate unique numbers. Each time you ask the sequence for a value, it provides you with one that is greater than the last time you asked. Each value is guaranteed to be used a maximum of one time. With a sequence, you can create efficient surrogate keys. Within SQL Server, you can create a column in a table with an extra property, identity. An identity column in SQL Server gets an automatic unique value each time a row is inserted in that table. The result is the same as with a sequence. Identity is easier in terms of its use. Both methods provide you with a candidate key.

A good surrogate key is meaningless. That means that you cannot infer anything from its value. In France, an automobile's license plate shows you where the car comes from. The code has some intrinsic meaning. In the Netherlands, a license plate has no meaning other than the fact that it is a unique code. If you keep track of how license plates are generated, you will be able to tell how old a car is. But that is basic information from another table. You cannot tell where in the Netherlands the car was registered or what type of car it is from its license plate.

Meaningless values are more flexible in terms of their use. You don't get into any sort of trouble by misinterpreting key values. You also don't encounter issues if, for some reason, the meaning of the values changes over time.

A surrogate key should be meaningless (most of the time).

Everybody knows surrogate keys. Most of the products you buy have a product number. You have an account number for your bank account. You probably have an insurance policy number, a customer number, and so on. All are surrogate keys in a database to make rows unique in that particular database.

In databases, there are two kinds of keys that you will define that play a crucial role in the database: primary keys and foreign keys. Let's have a look at them, starting with the primary key.

Primary keys

All that has been said hitherto in relation to keys is an introduction to primary keys. Since checking the uniqueness of rows in tables is too expensive when you do it on a per-column basis, we choose to check uniqueness based on one column only. We choose this column and make it the primary key. As soon as you make a column the table's primary key, the database will start using this column to check uniqueness. As long as this column holds unique values, no value is stored twice or more. The rows are said to be unique.

The primary key is a candidate key chosen by the database developer to be used by the database for checking the uniqueness of entire rows.

An important choice when designing databases is the selection of primary keys. This choice impacts the performance of the database. Besides performance, it also has an impact on how well uniqueness is actually enforced, not from a technical perspective, but from a business perspective.

Foreign keys

Before we go into the details of choosing a proper primary key, we need to introduce the foreign key. As stated earlier, a database will most likely have multiple tables. According to the definition of a database stated earlier, the data in the database is related. This means that different tables will have relationships with each other. This relationship is created by storing the primary key of one table in another table. In this other table, it is called the foreign key. It is not a key in this table, however, as it is for the other table. This foreign key references the table where the foreign key is the primary key. Because the value is unique in the table where the column is the primary key, each value in the foreign key references a specific, unique row in the table where the same column is the primary key.

To make this clear, have a look at Figure 1.5. Two employees are being paid monthly salaries. For the employee Peter, there are three payment records, one for each month for which he received a salary, while Janneke has been paid a salary twice:

Figure 1.5 – Salary payment

Figure 1.5 – Salary payment

Instead of storing everything in one table, we could store the same information in two separate tables as well. All the details about a person that have nothing to do with the salary they receive are stored in one table, while everything to do with actual payments is stored in a separate table. That leads to the design of Figure 1.6:

Figure 1.6 – Normalized salary payment

Figure 1.6 – Normalized salary payment

In Figure 1.6, the second table has a column called EmployeeID. This column does not store unique values because we enter a new row in this table for every month that a person receives a salary. For each month, we have to store the same value for EmployeeID. In the first table, however, it uniquely identifies each employee. Each value is used only once and can be a perfect fit for the primary key. In the second table, it is used as a reference to a row from the first table, making it a foreign key.

A foreign key is a column that references the primary key in another table, enabling us to combine rows from two tables. This enables us to divide our data over multiple tables without losing the relationships that exists between data elements. Splitting tables is the basis of normalization.

Choosing the primary key

The primary key plays a crucial role in databases. Logically, it is the characteristic we use to make each row unique within a table. It enables us to always select individual rows to work with. This means that we can work with unique customers, unique products, and so on. This is a hard necessity when you implement a webshop, for example. Consider a webshop that cannot distinguish one product from another, meaning they just send you a random product. Would you shop at that webshop?

As a database developer, you choose the primary key from all the candidate keys (including a surrogate key that you may add). For years, people debated whether a logical key or a surrogate key was the better choice. In Data Vault modeling, you preferably use business keys, which is a logical option. Almost all other modeling techniques prefer surrogate keys. Surrogate keys have a couple of advantages over business keys:

  • They provide better stability.
  • They never need to change.
  • They provide better performance.

Surrogate key advantage 1 – Better stability

Imagine you had to set up a Dutch database before the year 2007 and you needed to design an Employee table. Prior to 2007, the Netherlands used SOFI numbers, which are equivalent to SSNs. It was obligatory to store this information regarding your employees. Since every person in the Netherlands had a unique SOFI number, this is a candidate key and you could have used it as the primary key.

Each table that has a relationship with the Employee table, for example, the Salary payment table, uses the SOFI number to keep track regarding which payment belongs to which employee. The SOFI number is now used in two tables. A primary key in general has a big chance of being used in several other tables as a foreign key.

In 2007, the Dutch government decided to switch from SOFI numbers to BSN numbers. To make this change in your database, you now need to change all columns' SOFI numbers to columns storing BSN numbers. However, this is far more complex than changing "ordinary" columns. You need to not only change the column in the Employee table, but also the columns acting as foreign keys in other tables. This takes time and you run the risk of something going wrong, making your data invalid or inconsistent. You do not want to end up in a situation like this, but you cannot stop the government from changing the rules.

With a surrogate key, you will never have this kind of problem. No government can force you to change your own meaningless values that you use as keys. Even new legislation such as the GDPR cannot hurt you because your key values are meaningless and, by themselves, can never provide you with any privacy-related information. If you use initials as primary keys, however, you will have problems. Initials may lead you to the actual person. There is meaning in them. Simple numerical values are the best choice.

Surrogate key advantage 2 – No need to change

Another type of stability is the fact that surrogate keys never need to change. You cannot guarantee the same for business keys. What if you use SSNs as primary keys and, after a year, you find out that you made a typo when entering a person's SSN. You will have to change the value to the correct SSN, again both in the primary key as well as in all the tables where the SSN is now used as a foreign key.

Apart from having to change the same value multiple times (which takes up precious time), the database will have to check the uniqueness of the new value. So, changing a column that should hold unique values is more expensive than changing an "ordinary" column. In Azure, expensive translates into both time (it will take longer, which is bad for performance, but also potentially for concurrency) and money (you may need to assign more compute power to your recourses).

Whenever a value is meaningless, it is also meaningless to change the meaningless value into another meaningless value (I love this sentence). Using surrogate keys avoids these problems.

Surrogate key advantage 3 – Better performance

The most important argument for using surrogate keys is the overall performance of the database. A database with perfect data but slow performance will certainly be a failure when used in everyday applications. A database needs to perform well, especially when it is an operational database, such as, for instance, the database of a webshop. Apart from the performance of individual queries, we also need to worry about scalability. Will the database still perform to an acceptable level when, in a year or so from now, we have a lot more data in it and/or we have many more concurrent users? Bad performance will always hurt scalability as well.

When using a surrogate key, the database developer gets to choose the data type of the column used as the primary key. Computers work fastest with numbers. Numerical values are small in memory and easily stored as binary values. Business keys are often bigger in terms of how many bytes they need, which makes surrogate keys more efficient.

This argument is important on three different levels:

  • Performance when loading data or entering new data
  • Performance when joining data
  • The actual size of tables

Performance when loading data or entering new data

For each new row that is entered in the database, the key needs to be checked in terms of uniqueness. That check is faster (takes less time and compute) with a small efficient key. This means that the maximum possible insert rate on adding data to a table increases. Besides this, entering new rows will be faster when the key values are stored in ascending order and new rows have higher key values than already existing rows. This happens to be exactly the case with surrogate keys when you let the database generate values to be used as primary keys.

Note

Loading new data in a table is faster when using a small primary key with always increasing values.

Performance when joining data

As already mentioned, a primary key is used in other tables as a foreign key. Very often, you will have to combine tables into a single result set when using the data. You might, for instance, need to combine name and address information from an Employee table with salary payment information to create an annual statement. You need to combine both tables using the keys. The database will compare all the values in one table with all the values in the other table and combine those rows with matching values.

Comparing two values is not a problem for a database. However, if your tables have millions of rows, this may lead to two values being compared millions of times. And when multiple users concurrently query the database using joins, this becomes even more problematic. When your key is 128 bits in size instead of 64 bits, these comparisons take twice as many CPU cycles. That either means that the performance is going down, or it means that you have to spend more money on a service tier that provides you with more CPU capacity. With business keys, you are dependent on the size of the actual data, while with surrogate keys, you always choose integer data types that can fit within 64 bits. Multiples of 64 bits are important since we work with 64-bit computers.

If you work with relatively small databases with only a couple of users, making "bad" choices will most likely not harm you (too much). Making proper choices at design time is no more difficult or expensive than not caring. Changing an already operational database that already stores data is always expensive. It might mean that you have to take the database offline for some time. It may even mean that some code might break. And databases tend to grow. Make sure your database will scale with that growth. A lot of minor performance enhancements may have a huge impact when combined.

The actual size of tables

Some tables store millions, or even billions, of rows. Suppose a table of that size holds a foreign key referencing another table. A supermarket, for instance, might have a table with a row for each individual product scanned at the register. Each row contains information about the product being scanned, the price of the product, the date and time of the transaction, and the person operating the register. How many new rows would you get if this is a supermarket that operates nationwide?

In this example, you could store the product name of the product being sold in each row or just a product number. So, you store something like "Peanut butter," or just the number 1, assuming that in the Product table, the product with the name "Peanut butter" has 1 as its product number. The name in this case is 13 characters long. That would normally take 26 bytes to store in SQL Server. If you choose int to be the data type for the ProductNumber column, you need only 4 bytes to store the product numbers. When you use ProductNumber as the key instead of using ProductName, you would save 22 bytes per row. With a billion rows, that equates to 22 billion bytes or 22 GB.

Inefficient keys make databases (much) larger than they need be. In Azure, extra storage immediately leads to additional costs. It also means that your database server will most likely need more memory. You could run into memory-related performance issues a lot sooner. And like I said, changing the database later is far more difficult and costly than doing it right the first time.

With Data Vault as an exception, most people tend to agree that using surrogate keys is far better than using business keys. One downside of using surrogate keys is that you add extra meaningless data to your database. This extra column makes the table a bit larger and a bit less efficient. However, as explained above, the effect on the tables, where the key acts as a foreign key, is much greater, far outweighing this argument.

A surrogate key has one real disadvantage. Each time you enter a new row, the database automatically generates a new unique value. However, that means that there is no longer any real check on whether your actual row values are still unique. All columns except the surrogate key could have the same value, which essentially means you now have duplicate rows. Technically, you have unique rows; from a business perspective, you have duplicates.

All arguments considered, using surrogate keys is better than using business keys in almost all scenarios.

Surrogate keys are a better choice than using business keys for most tables in most databases. Data Vault-modeled databases are the exception to this rule.

Integrity

Now that we know how to choose primary keys, let's look at what foreign keys are for. They provide integrity for our data. Good data is critical to businesses, so let's look at how foreign keys can help to improve data quality.

Primary keys and foreign keys make working with data in databases straightforward when the data is stored in more than one table. By using primary keys, we ensure that we can read, change, or delete individual rows from tables. You don't necessarily need primary keys to perform these kinds of operations on data, but as explained in the section regarding the uniqueness of rows, you need the guarantee of being able to reference a unique row by a characteristic because SQL operates on sets of rows that have the same characteristic. SQL is a set-based language.

Without foreign keys, you wouldn't be able to merge data that is related but stored in different tables in a meaningful way. But when you know which column acts as a foreign key, you can always combine the data from separate tables in a single result set. All we require is that the person writing SQL queries needs to know the primary and foreign keys. It is not necessary to add that information to the database's metadata, although that is certainly an option. Have a look at the following code snippet:

CREATE TABLE dbo.Invoice
(
     InvoiceNr        int not null
     , InvoiceDate    date not null
     , CustomerID     int not null
     , SalesAmount    money not null
);
ALTER TABLE dbo.Invoice
ADD CONSTRAINT PK_Invoice PRIMARY KEY (InvoiceNr);
ALTER TABLE dbo.Invoice
ADD CONSTRAINT FK_Invoice_Customer FOREIGN KEY (CustomerID)
     REFERENCES dbo.Customer (CustomerID); 

Lines 1 to 7 create a new table called dbo.Invoice. This table has four columns: InvoiceNr, InvoiceDate, CustomerID, and SalesAmount. For each column, you can see the data type of the column – int, date, int, and money, respectively. The addition not null means that a column should have a value. It cannot be left empty when adding a new row. This part of the code would create a table without a primary key and without foreign key(s) when executed. Creating a table is perfectly OK in SQL Server; however, it is not a best practice.

Lines 9 and 10 use the ALTER TABLE statement to add a primary key to the dbo.Invoice table. This states that the InvoiceNr column should, from now on, be used as the primary key of this table. When there is already data in the table, the ALTER TABLE statement will fail if the actual values being used so far are not unique. Once the statement executes successfully, the table has a primary key, and the database will start enforcing the uniqueness of values within the InvoiceNr column. Uniqueness is now guaranteed.

Lines 12 to 14 of the code in the code snippet create a foreign key that references the dbo.Customer table using the CustomerID column in both tables. When executing these lines, the database will check for each value of CustomerID in the dbo.Invoice table, irrespective of whether a matching value exists in the dbo.Customer table. The code fails if the check fails. Failing the check would mean that you have invoices of non-existing customers.

The real value of a foreign key is referential integrity. Referential integrity is the guarantee that all values in the foreign key reference existing values in the primary key. Upon entering a new row in the dbo.Invoice table, the database checks whether the CustomerID value entered exists in the dbo.Customer table. The same check is performed when you change the CustomerID value of an existing invoice. Whenever a customer is deleted from the database, the database will first check whether that results in invoices without customers. Whenever a check fails, the operation is not performed by the database. An error is thrown instead.

Referential integrity

Referential integrity is the guarantee that references to other tables using foreign keys always reference existing rows.

You could change the default behavior of throwing errors. You can add extra properties to a foreign key:

  • ON DELETE CASCADE | NULL | DEFAULT
  • ON UPDATE CASCADE | NULL | DEFAULT

With the addition of an ON DELETE clause to a foreign key, you specify either CASCADE, NULL, or DEFAULT. Cascade means that, in the example of the code snippet, every time you delete a customer, all their invoices are automatically deleted as well. That, of course, guarantees that you do not have invoices of non-existing customers following deletion, but you also lose the invoice information. That doesn't seem a good strategy in this particular example. Be careful with using the CASCADE option.

Specifying NULL instead of CASCADE means that the CustomerID column in the dbo.Invoice table will be made empty (NULL in database terms) instead of deleting the invoice. You retain the invoice, but you lose the information regarding which customer the invoice belonged to. In our example, this option is not possible because our CustomerID column is defined as NOT NULL. It cannot be empty.

The DEFAULT option is only possible when the CustomerID column has a so-called DEFAULT constraint defined on it. This means that the column will automatically get a value when you enter a new row without specifying a value for the column explicitly. The DEFAULT option specified on the foreign key will assign the default value to the CustomerID column in the dbo.Invoices table when you delete a customer.

The same options apply to the ON UPDATE clause that you can add to a foreign key. They apply when you change the value in the CustomerID column in the dbo.Customer table.

Both options have little to no value when using surrogate keys. Changing surrogate keys is meaningless, which means you do not need the ON UPDATE clause. And we almost never delete data from databases.

The real value of foreign keys is the referential integrity they provide. Through foreign keys, you get more consistency in your data. When your data is incorrect or inconsistent, the information you derive from it becomes incorrect. And people base decisions on the information derived from the data. Good data is critical to businesses.

A point to note is that extra checks in the database always cost some extra time and processing power. You have to balance increased data quality against (a bit of) performance loss. Most of the time, quality should win over the other arguments. But if you need the highest throughput and scalability you can get, you may choose to leave out some quality checks to achieve this goal. As a database designer, you decide where the priorities lie. You always adjust your design to the intended use and you always have to prioritize different options that you need to choose from.

Recap

From all the available candidate keys in a table, you choose one to become the primary key. When using normalization or dimensional modeling techniques, the primary key is preferably a surrogate key. The database checks the uniqueness of entire rows in a table only in terms of the uniqueness of the values of the primary key.

Foreign keys provide you with a means to combine different but related tables. This is crucial in working with relational databases. Similar techniques are used when designing NoSQL databases as well but without tables. No matter which modeling technique you use, a relational database will always consist of multiple tables that are related using foreign keys.

The Check and Unique constraints

Both the primary key and the foreign key are called constraints, as you can see in the code snippet of the previous section. They limit the values that you can store in a column. For a primary key, the limitation (constraint) is that you cannot store a value that is already used in another row. For a foreign key, the constraint is that you can only use values that exist in the related table. There are two more constraints that we should mention.

The Unique constraint is really comparable to the primary key. A Unique constraint is applied to a column and ensures that the column can only store a value once. That allows you to use a surrogate key as a primary key and, at the same time, have the database check business keys for uniqueness.

With a Check constraint, you can limit the values a column can store by applying extra (but simple) rules. For instance, a Price column may have a money data type, making it a numerical column with precision to four decimal places. With a Check constraint, you can further specify that only positive values are allowed. Negative product prices do not make sense.

Look at the following code snippet:

ALTER TABLE dbo.Invoice
ADD CONSTRAINT CK_SalesAmount CHECK (SalesAmount > 0);

Once the code snippet here is executed, the SalesAmount column will no longer accept negative values.

Note

Constraints help to improve data quality in databases.

Now that we have learned about relational theory and how keys and constraints help us to improve the performance of the database and improve the quality of the data it contains, it is time to look at how we use the database. A different use case warrants a different design.