Home Data Data Modeling for Azure Data Services

Data Modeling for Azure Data Services

By Peter ter Braake
books-svg-icon Book
eBook $38.99 $26.99
Print $54.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $38.99 $26.99
Print $54.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: Introduction to Databases
About this book
Data is at the heart of all applications and forms the foundation of modern data-driven businesses. With the multitude of data-related use cases and the availability of different data services, choosing the right service and implementing the right design becomes paramount to successful implementation. Data Modeling for Azure Data Services starts with an introduction to databases, entity analysis, and normalizing data. The book then shows you how to design a NoSQL database for optimal performance and scalability and covers how to provision and implement Azure SQL DB, Azure Cosmos DB, and Azure Synapse SQL Pool. As you progress through the chapters, you'll learn about data analytics, Azure Data Lake, and Azure SQL Data Warehouse and explore dimensional modeling, data vault modeling, along with designing and implementing a Data Lake using Azure Storage. You'll also learn how to implement ETL with Azure Data Factory. By the end of this book, you'll have a solid understanding of which Azure data services are the best fit for your model and how to implement the best design for your solution.
Publication date:
July 2021
Publisher
Packt
Pages
428
ISBN
9781801077347

 

Chapter 1: Introduction to Databases

Data has become increasingly important over the last few years. Almost all applications use data, whether the application is a Customer Relationship Management (CRM) system at work or a social media app on your phone. All that data is stored in databases. Since the 1980s, almost all those databases have been relational databases. Nowadays, with the advent of big data, there are different ways to store and process huge amounts of data. Some of them can be classified as so-called NoSQL databases. NoSQL stands for "not only" SQL. This means that we are seeing other types of databases emerge and being used alongside relational databases. NoSQL databases are important in the area of big data. The "SQL" in NoSQL stands for Structured Query Language. This is the programming language of relational databases and has become the "equivalent" of relational databases.

In this chapter, you will learn the basics of databases. A lot of the theory discussed in this chapter stems from relational databases, although the majority is applicable to other database systems as well.

We will discuss the following topics in this chapter:

  • Overview of relational databases
  • Introduction to Structured Query Language
  • Impact of intended usage patterns on database design
  • Understanding relational theory
  • Keys
  • Types of workload
 

Overview of relational databases

Databases hadn't yet been invented when we first started programming computer applications. All data had to be stored in files. Oftentimes, those files were simple comma-separated value files (CSV files). An example of a CSV file can be seen in the following screenshot:

Figure 1.1 – Person.csv

Figure 1.1 – Person.csv

As you can see, it is just some data without anything else.

Files

Using files to store data for use in applications entailed a number of issues. After trying file formats other than CSV files, developers started using databases instead of plain files. Plain files or flat files are files with just data stored in them. Figure 1.1 is an example of a flat file. Let's look into the issues that using flat files posed.

From the header of the screenshot in Figure 1.1, it is clear that the file is called Person.csv. We may infer that the data in the files represents persons. However, it is not clear whether those people are patients, customers, employees, or even someone completely different. Furthermore, you cannot ascertain that extra information from the file or its content.

Drawbacks

The use of these types of flat files to store data comes with three drawbacks:

  • You cannot infer from the file itself what the data is about.
  • It is not flexible from a programming perspective and is bad for performance when working with the data.
  • It is (almost) impossible for multiple persons to work with flat files simultaneously.

We will now examine each of these drawbacks in turn.

Drawback 1 – You cannot infer from the file itself what the data is about

It is clear from looking at the screenshot that each line has two commas, meaning that there are three columns per row. The second column very likely holds a first name. This is a reasonable assumption based on our knowledge of names, although you may require a knowledge of Dutch names to make this assumption. The third column is more difficult to guess. It could be the year of birth of the person in question, but it could also be a postal code or perhaps a monthly salary.

The file only stores the actual data and not the metadata. It may be that you can guess what the values mean, but you cannot infer it from the file itself. Metadata is the data describing the data. Column names are an example of metadata. Column names such as PatientID, Patient_FirstName, and PostalCode would already render it more readable. That is why we often add those column names as a first row in flat files.

Note

Metadata is data that describes the "actual" data.

There is even more to ascertain regarding this data. You cannot perform calculations with postal codes, such as adding up two postal codes (it may be that you can, but it doesn't make any sense). A postal code is an alphanumeric code that you cannot perform computations with. When the last column in Figure 1.1 is a salary and not a postal code, you do want (and need) to be able to perform calculations on this column, for instance, to calculate an annual salary from the monthly salaries. In this case, the column would have been numerical. In other words, it would be beneficial to know a column's data type. Generally speaking, data can be numerical, alphanumerical (text), or dates. Nowadays, of course, there are a lot of variations, such as binary data for pictures.

With data stored in flat files, the data itself and the metadata are stored separately.

Today, we have overcome some of these issues by not using flat files but storing data as XML or as JSON files. Both file types allow you to store metadata with the actual data in the file itself. In the (recent) past, this was too expensive to do. Only recently has storage become cheap enough and compute power plentiful enough to work with text files by storing data and metadata in the way that JSON does.

Drawback 2 – It is not flexible from a programming perspective and is bad for performance when working with the data

It gets nastier when we start using (old-fashioned) program code to work with the data. Suppose you need to know the postal code of the person called Janneke. Your code would now look something like this:

  1. Read a line.
  2. Read the second column.
  3. If the value you read equals Janneke, then return the third column.
  4. Repeat lines 1 to 3 until there are no more lines in the file.

With only two lines in the file, this is pretty fast. This code will become more problematic, however, when a file contains many, many rows. It will become really slow.

It gets even worse when someone changes the file structure. Suppose we add a new column, storing the patient's family name between the second and third columns. The code we just wrote will break because it assumes that the postal code is the third column. However, following the change, it is the fourth column. Retrieving the postal code should be independent of which column it actually is.

Drawback 3 – It is (almost) impossible for multiple persons to work with flat files simultaneously

In most applications, there will be multiple users working with the data simultaneously. What if your webshop could only have one visitor at a time? A database should make it easy for multiple people or processes to work with the same data at the same time. In the case of relational databases, this is part of their core. A relational database has what is known as the ACID properties to cater to multi-user workloads. You will learn more about the ACID properties in Chapter 5, Designing a NoSQL Database. Without a database system, whether relational or not, multiple users working with the same data would not be impossible, but you will get consistency issues if you don't implement complex logic to prevent inconsistencies.

If you always process all the data in a flat file as a whole, and you do that, for instance, during the night, flat files are fine to work with, as we will see in Chapter 10, Designing and Implementing a Data Lake Using Azure Storage. However, if you need to work with individual pieces of information from within a flat file in real time, you will not be able to do that in an acceptable manner.

At first, smart workarounds were invented to make working with flat files easier and more efficient. There were files such as ISAM files and VSAM files. It is beyond the scope of this book to go into these different file types. More interesting for us is the fact that the problems described in this paragraph led to the introduction of database management systems (DBMSes).

Relational databases

A database is a self-describing collection of related data with the aim of providing information to people and applications.

The first database appeared in the 1960s. These databases were hierarchical databases. A little later, network databases were introduced, but neither type of database offered the flexibility to work with (large amounts of) data in more complex organizations with multiple users.

In the early 1970s, E.F. Codd, an English mathematician working for IBM, came up with a theory of how to create relational databases. He based his theory on mathematical set theory. This theory describes sets of elements that are potentially really large in a few simple rules (that will be covered later in the chapter). Codd realized that mathematical set theory could not only be applied to something abstract such as all even numbers, but also to real live collections such as all our customers. This rendered set-based theory useable in relation to the data we were working with and the data we needed to store in databases.

The name relational database stems from the fact that data is stored in tables. For example, take a set of numbers {1, 2, 3, 4}. Then, imagine a second set, for instance, a set of names {Peter, Janneke, Jari, Mats}. We could combine these two sets in a table, as shown in Figure 1.2:

Figure 1.2 – A table of patients

Figure 1.2 – A table of patients

We started with independent sets of values. We created a relation between the two sets by combining them into a table. Making a table with rows is like saying the values 1 and Peter belong together, just as 2 and Janneke do. This makes the sets no longer independent. A relationship now exists between the values in one column and the values in another column. In other words, the table is the relation between the PatientID set and the PatientName set. Relation here is another word for Table.

Note

Relational databases store data in tables.

We see something more in Figure 1.2. The first column is called PatientID. The column header also specifies that the data type of the values in this column is int. This means that this column can only store whole numbers. The second column is of the nvarchar(50) type, specifying that it stores alphanumeric values (text) with a maximum length of 50 characters. This metadata is part of the table itself. The data and the metadata are now a whole instead of separately stored pieces of information.

Note

In a database, data and metadata are combined in a single structure.

A relational database is normally more than just one table. In real life, a relational database can consist of thousands of tables. According to the definition of a database, it is a collection of related data. This means that the tables have relationships with one another. Since relationship sounds a lot more like relational than table does, a lot of people came to believe that a relational database got its name from related tables. However, as stated previously, storing data in tables is what makes a database relational.

Relational Database Management System

A Relational Database Management System (RDBMS) is a piece of software that allows you to create and manage databases that adhere to Codd's theory. That turns an RDBMS into an application that allows you to create tables and then store data in those tables. The "management" part is all the extra "services" you get from an RDBMS, such as securing your data so that only authorized people can work with the data. An RDBMS allows you to work with data, from creating tables to storing and managing the data and its accessibility. Examples of well-known RDBMS systems include Microsoft SQL Server, Oracle, IBM DB2, MySQL, and MariaDB.

Note

An RDBMS is a database product that follows Codd's rules of the relational model, allowing you to work with and manage all your data.

We previously referred to a couple of problems that we encountered in the past when using CSV files. One was the lack of metadata. Relational databases rectify that problem. The question that remains to be answered is how relational databases offer the flexibility and performance needed that CSV files couldn't offer. To do that, we first need to introduce the SQL language.

 

Introduction to Structured Query Language

Structured Query Language (SQL) is the language of all relational databases. You use SQL to read data from the database and to manipulate existing data. Creating a database or creating tables within the database and securing your data is also done using SQL. Database developers might use a tool to graphically create databases or tables, so you don't need to write SQL code yourself. The tool will generate the SQL code for you because that is all the actual database engine understands.

Different categories of SQL

SQL consists of three main categories:

  • DCLData Control Language
  • DDLData Definition Language
  • DMLData Manipulation Language

Data Control Language

DCL is the part of SQL that helps to secure the data. DCL comprises three statements:

  • GRANT
  • DENY
  • REVOKE

Even though securing data is very important in any data solution that you build, DCL is outside the scope of this book. You should check out the TutorialRide website to learn more about this: https://www.tutorialride.com/dbms/sql-data-control-language-dcl.htm.

Data Definition Language

With DDL statements, you create databases themselves and objects within databases. As with DCL, there are three statements:

  • CREATE
  • ALTER
  • DROP

With CREATE TABLE, you can make (create) a new table. Whenever the table structure needs to change, for instance, you want to add a column to an existing table, you use UPDATE TABLE. With DROP TABLE, you completely remove a table and all its content. You will learn about these statements in Chapter 4, Provisioning and Implementing an Azure SQL DB.

Data Manipulation Language

DML is the part of SQL that is used for working with the actual data stored in the tables in the database. DML has four statements:

  • SELECT
  • INSERT
  • UPDATE
  • DELETE

The SELECT statement lets you read data from tables. Some people believe that SELECT should be a category of its own: DQL or Data Query Language. The other three statements are self-explanatory: INSERT adds new rows to an existing table, UPDATE changes the values of already existing rows in a table, and DELETE removes rows from a table.

This book is not about SQL. There are a lot of tutorials on the internet on SQL. If you need to familiarize yourself with T-SQL (the dialect of SQL Server that we will use throughout this book), I strongly recommend the books of Itzik Ben-Gan. SQL is also used a lot in NoSQL databases and is still the basis for every data professional.

Understanding the database schema

With relational databases, the first step is to create a table using CREATE TABLE. While creating the table, you specify the name of the new table. You also specify all the columns of that table by adding a column name and the data type of the column. This means you start by creating the metadata. All the metadata combined is referred to as the database schema.

Often, people merely mean the table structure, the tables, and their relationships when they use the term schema. In SQL Server, there is even an object called a schema that helps in establishing a good security strategy for your database.

Note

The schema of a database refers to all tables and their relationships.

Once you have created tables, you can start loading data into the table using the INSERT statement. With each row you enter, you provide values for all the columns in the table. The values you enter should follow the rules defined in the table definition. With what we have learned so far, this means that the values should be of the correct data type. You cannot enter text in a numerical column. You will see shortly that there are further constraints. The process of creating a table first and then entering data means that whatever data we add (for example, INSERT) has to adhere to that structure.

After you have entered data into the database, you can start working with the data. With the SELECT statement, you can read data from the database. The database uses the existing metadata while retrieving data from the database. For instance, you could write the following SELECT statement:

SELECT
     PostalCode
FROM
     Persons
WHERE
     Name = 'Janneke';

Notice that this statement is the same example as described in the section about files. In the preceding snippet, we read the PostalCode column from the Persons table for the row that has 'Janneke' as the value in the Name column. The database uses the metadata, in this case, the table name and column names, to retrieve the data and, where possible, to optimize the query. By using metadata, it doesn't matter whether the PostalCode field is the second or the third column. Using metadata makes querying the data more flexible.

In addition to the flexibility we gained by using a table over a flat file, there is no step such as repeat this for each row, as we saw in the section on files. A table in a relational database implicitly works with all rows. That is called working set-based. In relational databases, you don't work with individual rows but always with sets. Depending on the filters you provide (the WHERE clause of the SQL statement), your set might contain just one row or it might contain multiple rows, or it can even be an empty set with no rows at all. The database engine can optimize the query for fast query response times. The database engine uses its knowledge of the metadata. It can also take advantage of extra structures that you may define in the database, such as indexes. You will learn about indexes in Chapter 4, Provisioning and Implementing an Azure SQL DB.

Note

Database systems take advantage of all the metadata defined in the schema to optimize queries in order to obtain good query performance even in the case of large datasets.

Now that we have learned why (relational) databases work better for storing data than CSV files, let's look at the different use cases of databases and the impact this has on database design.

 

Impact of intended usage patterns on database design

The simplest database consists of just one table. However, as has already been mentioned in the section on files, most (if not all) databases have multiple tables, up to tens of thousands of tables, and sometimes even more. The most important step in creating a new database is to decide which tables you need in order to store your data efficiently for its intended use and what relationships those tables have with one another. Data modeling is the process of deciding on that table structure and thereby deciding on the schema.

Note

Data modeling is the process of deciding which data structures to use for storing data. With regard to relational databases, this translates into deciding on the table structure (schema) to use.

There are different concepts for modeling data structures. We can, for example, normalize databases, invented by E.F. Codd. E.F. Codd translated mathematical set theory into formal steps to reach an optimal schema. Normalizing a database involves following these formal steps. It is often said that a database is in the third normal form, meaning the steps have been applied to the data until the third step. You will learn how to normalize a database in Chapter 3, Normalizing Data.

In Chapter 7, Dimensional Modeling, and Chapter 9, Data Vault Modeling, you will learn about alternative data modeling techniques. These alternative techniques lead to different database schemas. Query performance and query complexity depend heavily on the schema you choose. In Azure, this may translate directly into how expensive your data solution is going to be. If you need more processing power to execute a query, you might have to provision a higher performance tier of the service you use, which will cost more money. With a different intended use, meaning you use different queries, a different schema may be beneficial.

A database supporting a web shop should be able to quickly retrieve all information about a specific product. In addition, it should be easy and quick to create a new shopping basket as well as new invoices. This type of database is best designed by using the technique of normalization. A data warehouse should be able to quickly retrieve vast amounts of data and aggregate that data. A report based on a data warehouse should, for instance, show all the sales data of an entire product group over the last 12 months. Where a webshop often retrieves single rows from the database, a report often retrieves millions of rows. A dimensionally modeled database will be better in retrieving millions of rows than a normalized database.

In Chapter 5, Designing a NoSQL Database, we will look at NoSQL databases. NoSQL databases have their own rules depending on the type of database you use.

Whatever database you use, the similarities will be bigger than the differences from a modeling perspective. This is especially true for all types of relational databases. A database is always a collection of data, a set of data, that can be described by set theory. The next section will describe how the theory behind databases has its foundations in mathematical set theory.

 

Understanding relational theory

Let's look in a bit more depth into relational databases. They form a big part of everyday processes working with data. They also constitute the majority of this book, and what you will learn about relational theory will also help in using non-relational databases.

A relational database stores data in tables according to Codd's rules of the relational model.

Relational theory has two important pillars:

  • Elements of a set are not ordered.
  • All elements in a set are unique.

Pillar 1 – Elements of a set are not ordered

The first pillar from set-based theory is both neat and, on occasion, a bit troublesome.

I once had to visit a customer of the company I worked for because he had an issue. Whenever he added a new order to his database manually, he checked to see whether everything was OK by opening the Orders table and scrolling down to the last row. Whenever he failed to find his new order, he would enter it again. At the end of the month, he always had too many orders in his database!

In this example, my customer assumed that the row entered last would automatically be the last row in the table or the row showed last on his screen. He assumed opening a table in his application would retrieve the rows (orders) in the same order as in which they were entered. Both assumptions are incorrect (although, in theory, the latter depends on the application in which you open a table and not on the database).

A database doesn't store rows in any specific order. This provides both flexibility and write performance. However, a database administrator or developer could create indexes on tables. By using indexes, we can speed up data retrieval (query) times, in the same way an index in a book makes it quicker to find a specific topic. Certain types of indexes store the data ordered, giving the database the option to use the sorting order to quickly find rows that we are looking for. That index might be based on the customer's last name and not on the date the order was inserted in the database. Furthermore, a Database Administrator (DBA) might decide to replace the index with another one. In that case, the data is stored in a different order than it was previously. You will learn about indexes in the various chapters that zoom in on specific Azure services, such as Azure SQL Database.

Without indexes, rows are stored without a specific sorting order. With indexes, that might change depending on the index you use. But if you query the database, it is not guaranteed that the database will actually use the index. When writing a query, you can explicitly specify the sorting orders for all the rows you want returned. That sorting order is now guaranteed. If you don't ask for a specific sorting order, you will get the rows back in a random order. That order is determined by whether the database decided to use an index and additional factors, such as fragmentation in your database.

When using a relational database, the lack of order is sometimes confusing. People almost always think with some sort of ordering in mind, such as the last, best, and biggest. As long as you remember to explicitly specify a sorting order when querying a database, you will be fine. If you don't need your rows to be in a specific order, don't request it. The sorting of (large) datasets is always expensive.

Tip

When you require rows to be in a specific order, you need to specify that sorting order explicitly in your query.

Pillar 2 – All elements in a set are unique

In mathematics, a set is a collection of unique elements. For example, the set of uneven numbers consists of all uneven numbers. The number 3 is part of that set. It does not appear twice in the set. Each uneven number is in the set only once. This is called the axiom of unicity.

When applying mathematical set-based theory to everyday live databases, unicity comes in very handy. A table storing customers is the set of all customers. In a business-to-business scenario, a business is either a customer or it is not a customer. When it is a customer, we store the details in the Customer table, otherwise, we don't. We do not store every potential customer and we preferably do not store the same customer twice.

Unicity demands that each element of a set is unique.

Suppose a company is stored twice in your database. That will lead to a myriad of problems. To start with, simply counting the number of rows to determine how many customers you have will lead to the wrong result. A simple report showing sales by customer will also no longer be correct, nor will average sales by customer. The list goes on. You want customers to be unique within your database. As a different example, have a look at Figure 1.3. Can you tell how many patients live in Amsterdam?

Figure 1.3 – Patients with duplicate rows

Figure 1.3 – Patients with duplicate rows

Apart from functional reasons for keeping rows unique in a table, there are technical reasons as well. Suppose you have a table like the one in Figure 1.3. You decide that the first two rows describe the same patient, so you want to remove one of the rows from the table. You cannot do that with SQL. You can delete the row(s) where PatientID equals 1. The database will delete both rows. However, removing Peter or removing rows with Amsterdam also leads to both rows being removed. To delete a single row, you must be able to uniquely identify that row. In the example shown, there is no characteristic (column) separating row 1 from row 2. You cannot delete one without deleting the other at the same time.

Some RDBMS systems keep track of rows by adding a unique identifier under the covers. In that case, you can retrieve that value and delete one row using this value. Microsoft SQL Server is not one of those systems. You need to make sure rows are unique (or live with the consequences).

To make sure each row in a table is unique, you (the database administrator) should check for each newly inserted row that does not already exist. The simplest way you can do that is by comparing the new row column by column to see whether you already have a row with identical values for each column. This is easy enough, but really time-consuming. With 10 million rows in a table and 40 columns, this means you need to perform 400 million comparisons. You need to do that each time you enter a new row or change an existing row. You could utilize more optimized search algorithms but they will still not make this a fast operation.

Note

To enforce the uniqueness of rows, we use keys. A key is a column, or a couple of columns together, in which we store unique values.

A key is a column, or a combination of columns, that stores unique values by the nature of the data stored in the column. For instance, each car has a unique license plate. When you create a table that stores all the vehicles within your company, each vehicle has its own unique license plate. You can identify a specific car by its license plate, independent of the values of other columns within the table.

The context in which we store data can be important when it comes to columns storing unique values. Consider a table where we store a row of information each time a driver has damage to their car. Some people will have damage more often than others. Each time we store the details about the damage, we keep track of which vehicle has damage by storing the license plate. In the Vehicle table, the license plate will be unique, while in the Damage table, it will not be unique. You can see this in Figure 1.4. The Vehicle table holds one row per car. Each row describes a unique car with its unique license plate. The Damage table shows the claims of damage done to the car. Some cars have never had any claims made in relation to them, while others can be found in the table multiple times. This means that the same license plate is no longer unique:

Figure 1.4 – Vehicles and damage

Figure 1.4 – Vehicles and damage

Let's look closer now into how defining keys helps with keeping rows unique in your database.

 

Keys

Databases do not check rows for uniqueness automatically. Checking rows column by column is too expensive. The database developer needs to define a key. After you define a key, the database will use this to check for uniqueness. You need to define a key per table. You need to decide for each table whether or not you require (want) a key. You then need to decide which column to use as the key.

Types of keys

There are different types of keys. Let's have a look at some definitions for keys.

Candidate keys

Candidate keys are all columns within a table that logically store unique values. A table may have zero, one, or more columns that, by their nature, should store unique values in the context of the table they are in. Each column that adheres to the definition of a key is a candidate key. That makes the license plate in the Vehicle table of Figure 1.4 a candidate key. Other examples of candidate keys can be a Social Security Number (SSN) in a Persons table or an ISBN number in a Book table. An SSN or an ISBN uniquely identifies a person or a book, just like a license plate uniquely identifies a vehicle.

A key can be made up of several columns. In the Netherlands, the combination of postal code and house number is unique for an address. Websites can use this by automatically filling in a street name and city after you enter just your postal code and house number.

We can distinguish between two types of candidate keys – logical, or business, keys and technical, or surrogate, keys.

Logical keys

A logical key or business key is a key that stores values (information) with real meaning that you need to store, regardless of whether it can function as a key. No matter how you design your database, irrespective of whether you plan to use keys, the business key will be part of the data you are going to store. For example, all companies are obliged to store the SSN of their employees. Since the SSN column is part of your table, you can use it as a key.

In Data Vault theory (Chapter 9, Data Vault Modeling), business keys play a central role. Knowing your data (and the processes that use the data) becomes crucial. The business key is the characteristic that people use in their everyday lives to refer to objects. A course about data modeling is referred to (searched and found by) its name. The name is the business key. Names can be unique, but very often are not. That makes them dangerous to use as keys. Sometimes, however, they work OK.

Technical/surrogate keys

In some cases, there is not a single candidate key to be found. Adding an extra column and using unique values for it then becomes an option. That way, you are adding (creating) a candidate key. This is also an option when you do have candidate keys to start with. This type of key is called a technical key or surrogate key. Its contents are (normally) meaningless but unique.

Most database systems have features to easily create surrogate keys. For instance, a sequence is an object in the database that allows you to generate unique numbers. Each time you ask the sequence for a value, it provides you with one that is greater than the last time you asked. Each value is guaranteed to be used a maximum of one time. With a sequence, you can create efficient surrogate keys. Within SQL Server, you can create a column in a table with an extra property, identity. An identity column in SQL Server gets an automatic unique value each time a row is inserted in that table. The result is the same as with a sequence. Identity is easier in terms of its use. Both methods provide you with a candidate key.

A good surrogate key is meaningless. That means that you cannot infer anything from its value. In France, an automobile's license plate shows you where the car comes from. The code has some intrinsic meaning. In the Netherlands, a license plate has no meaning other than the fact that it is a unique code. If you keep track of how license plates are generated, you will be able to tell how old a car is. But that is basic information from another table. You cannot tell where in the Netherlands the car was registered or what type of car it is from its license plate.

Meaningless values are more flexible in terms of their use. You don't get into any sort of trouble by misinterpreting key values. You also don't encounter issues if, for some reason, the meaning of the values changes over time.

A surrogate key should be meaningless (most of the time).

Everybody knows surrogate keys. Most of the products you buy have a product number. You have an account number for your bank account. You probably have an insurance policy number, a customer number, and so on. All are surrogate keys in a database to make rows unique in that particular database.

In databases, there are two kinds of keys that you will define that play a crucial role in the database: primary keys and foreign keys. Let's have a look at them, starting with the primary key.

Primary keys

All that has been said hitherto in relation to keys is an introduction to primary keys. Since checking the uniqueness of rows in tables is too expensive when you do it on a per-column basis, we choose to check uniqueness based on one column only. We choose this column and make it the primary key. As soon as you make a column the table's primary key, the database will start using this column to check uniqueness. As long as this column holds unique values, no value is stored twice or more. The rows are said to be unique.

The primary key is a candidate key chosen by the database developer to be used by the database for checking the uniqueness of entire rows.

An important choice when designing databases is the selection of primary keys. This choice impacts the performance of the database. Besides performance, it also has an impact on how well uniqueness is actually enforced, not from a technical perspective, but from a business perspective.

Foreign keys

Before we go into the details of choosing a proper primary key, we need to introduce the foreign key. As stated earlier, a database will most likely have multiple tables. According to the definition of a database stated earlier, the data in the database is related. This means that different tables will have relationships with each other. This relationship is created by storing the primary key of one table in another table. In this other table, it is called the foreign key. It is not a key in this table, however, as it is for the other table. This foreign key references the table where the foreign key is the primary key. Because the value is unique in the table where the column is the primary key, each value in the foreign key references a specific, unique row in the table where the same column is the primary key.

To make this clear, have a look at Figure 1.5. Two employees are being paid monthly salaries. For the employee Peter, there are three payment records, one for each month for which he received a salary, while Janneke has been paid a salary twice:

Figure 1.5 – Salary payment

Figure 1.5 – Salary payment

Instead of storing everything in one table, we could store the same information in two separate tables as well. All the details about a person that have nothing to do with the salary they receive are stored in one table, while everything to do with actual payments is stored in a separate table. That leads to the design of Figure 1.6:

Figure 1.6 – Normalized salary payment

Figure 1.6 – Normalized salary payment

In Figure 1.6, the second table has a column called EmployeeID. This column does not store unique values because we enter a new row in this table for every month that a person receives a salary. For each month, we have to store the same value for EmployeeID. In the first table, however, it uniquely identifies each employee. Each value is used only once and can be a perfect fit for the primary key. In the second table, it is used as a reference to a row from the first table, making it a foreign key.

A foreign key is a column that references the primary key in another table, enabling us to combine rows from two tables. This enables us to divide our data over multiple tables without losing the relationships that exists between data elements. Splitting tables is the basis of normalization.

Choosing the primary key

The primary key plays a crucial role in databases. Logically, it is the characteristic we use to make each row unique within a table. It enables us to always select individual rows to work with. This means that we can work with unique customers, unique products, and so on. This is a hard necessity when you implement a webshop, for example. Consider a webshop that cannot distinguish one product from another, meaning they just send you a random product. Would you shop at that webshop?

As a database developer, you choose the primary key from all the candidate keys (including a surrogate key that you may add). For years, people debated whether a logical key or a surrogate key was the better choice. In Data Vault modeling, you preferably use business keys, which is a logical option. Almost all other modeling techniques prefer surrogate keys. Surrogate keys have a couple of advantages over business keys:

  • They provide better stability.
  • They never need to change.
  • They provide better performance.

Surrogate key advantage 1 – Better stability

Imagine you had to set up a Dutch database before the year 2007 and you needed to design an Employee table. Prior to 2007, the Netherlands used SOFI numbers, which are equivalent to SSNs. It was obligatory to store this information regarding your employees. Since every person in the Netherlands had a unique SOFI number, this is a candidate key and you could have used it as the primary key.

Each table that has a relationship with the Employee table, for example, the Salary payment table, uses the SOFI number to keep track regarding which payment belongs to which employee. The SOFI number is now used in two tables. A primary key in general has a big chance of being used in several other tables as a foreign key.

In 2007, the Dutch government decided to switch from SOFI numbers to BSN numbers. To make this change in your database, you now need to change all columns' SOFI numbers to columns storing BSN numbers. However, this is far more complex than changing "ordinary" columns. You need to not only change the column in the Employee table, but also the columns acting as foreign keys in other tables. This takes time and you run the risk of something going wrong, making your data invalid or inconsistent. You do not want to end up in a situation like this, but you cannot stop the government from changing the rules.

With a surrogate key, you will never have this kind of problem. No government can force you to change your own meaningless values that you use as keys. Even new legislation such as the GDPR cannot hurt you because your key values are meaningless and, by themselves, can never provide you with any privacy-related information. If you use initials as primary keys, however, you will have problems. Initials may lead you to the actual person. There is meaning in them. Simple numerical values are the best choice.

Surrogate key advantage 2 – No need to change

Another type of stability is the fact that surrogate keys never need to change. You cannot guarantee the same for business keys. What if you use SSNs as primary keys and, after a year, you find out that you made a typo when entering a person's SSN. You will have to change the value to the correct SSN, again both in the primary key as well as in all the tables where the SSN is now used as a foreign key.

Apart from having to change the same value multiple times (which takes up precious time), the database will have to check the uniqueness of the new value. So, changing a column that should hold unique values is more expensive than changing an "ordinary" column. In Azure, expensive translates into both time (it will take longer, which is bad for performance, but also potentially for concurrency) and money (you may need to assign more compute power to your recourses).

Whenever a value is meaningless, it is also meaningless to change the meaningless value into another meaningless value (I love this sentence). Using surrogate keys avoids these problems.

Surrogate key advantage 3 – Better performance

The most important argument for using surrogate keys is the overall performance of the database. A database with perfect data but slow performance will certainly be a failure when used in everyday applications. A database needs to perform well, especially when it is an operational database, such as, for instance, the database of a webshop. Apart from the performance of individual queries, we also need to worry about scalability. Will the database still perform to an acceptable level when, in a year or so from now, we have a lot more data in it and/or we have many more concurrent users? Bad performance will always hurt scalability as well.

When using a surrogate key, the database developer gets to choose the data type of the column used as the primary key. Computers work fastest with numbers. Numerical values are small in memory and easily stored as binary values. Business keys are often bigger in terms of how many bytes they need, which makes surrogate keys more efficient.

This argument is important on three different levels:

  • Performance when loading data or entering new data
  • Performance when joining data
  • The actual size of tables

Performance when loading data or entering new data

For each new row that is entered in the database, the key needs to be checked in terms of uniqueness. That check is faster (takes less time and compute) with a small efficient key. This means that the maximum possible insert rate on adding data to a table increases. Besides this, entering new rows will be faster when the key values are stored in ascending order and new rows have higher key values than already existing rows. This happens to be exactly the case with surrogate keys when you let the database generate values to be used as primary keys.

Note

Loading new data in a table is faster when using a small primary key with always increasing values.

Performance when joining data

As already mentioned, a primary key is used in other tables as a foreign key. Very often, you will have to combine tables into a single result set when using the data. You might, for instance, need to combine name and address information from an Employee table with salary payment information to create an annual statement. You need to combine both tables using the keys. The database will compare all the values in one table with all the values in the other table and combine those rows with matching values.

Comparing two values is not a problem for a database. However, if your tables have millions of rows, this may lead to two values being compared millions of times. And when multiple users concurrently query the database using joins, this becomes even more problematic. When your key is 128 bits in size instead of 64 bits, these comparisons take twice as many CPU cycles. That either means that the performance is going down, or it means that you have to spend more money on a service tier that provides you with more CPU capacity. With business keys, you are dependent on the size of the actual data, while with surrogate keys, you always choose integer data types that can fit within 64 bits. Multiples of 64 bits are important since we work with 64-bit computers.

If you work with relatively small databases with only a couple of users, making "bad" choices will most likely not harm you (too much). Making proper choices at design time is no more difficult or expensive than not caring. Changing an already operational database that already stores data is always expensive. It might mean that you have to take the database offline for some time. It may even mean that some code might break. And databases tend to grow. Make sure your database will scale with that growth. A lot of minor performance enhancements may have a huge impact when combined.

The actual size of tables

Some tables store millions, or even billions, of rows. Suppose a table of that size holds a foreign key referencing another table. A supermarket, for instance, might have a table with a row for each individual product scanned at the register. Each row contains information about the product being scanned, the price of the product, the date and time of the transaction, and the person operating the register. How many new rows would you get if this is a supermarket that operates nationwide?

In this example, you could store the product name of the product being sold in each row or just a product number. So, you store something like "Peanut butter," or just the number 1, assuming that in the Product table, the product with the name "Peanut butter" has 1 as its product number. The name in this case is 13 characters long. That would normally take 26 bytes to store in SQL Server. If you choose int to be the data type for the ProductNumber column, you need only 4 bytes to store the product numbers. When you use ProductNumber as the key instead of using ProductName, you would save 22 bytes per row. With a billion rows, that equates to 22 billion bytes or 22 GB.

Inefficient keys make databases (much) larger than they need be. In Azure, extra storage immediately leads to additional costs. It also means that your database server will most likely need more memory. You could run into memory-related performance issues a lot sooner. And like I said, changing the database later is far more difficult and costly than doing it right the first time.

With Data Vault as an exception, most people tend to agree that using surrogate keys is far better than using business keys. One downside of using surrogate keys is that you add extra meaningless data to your database. This extra column makes the table a bit larger and a bit less efficient. However, as explained above, the effect on the tables, where the key acts as a foreign key, is much greater, far outweighing this argument.

A surrogate key has one real disadvantage. Each time you enter a new row, the database automatically generates a new unique value. However, that means that there is no longer any real check on whether your actual row values are still unique. All columns except the surrogate key could have the same value, which essentially means you now have duplicate rows. Technically, you have unique rows; from a business perspective, you have duplicates.

All arguments considered, using surrogate keys is better than using business keys in almost all scenarios.

Surrogate keys are a better choice than using business keys for most tables in most databases. Data Vault-modeled databases are the exception to this rule.

Integrity

Now that we know how to choose primary keys, let's look at what foreign keys are for. They provide integrity for our data. Good data is critical to businesses, so let's look at how foreign keys can help to improve data quality.

Primary keys and foreign keys make working with data in databases straightforward when the data is stored in more than one table. By using primary keys, we ensure that we can read, change, or delete individual rows from tables. You don't necessarily need primary keys to perform these kinds of operations on data, but as explained in the section regarding the uniqueness of rows, you need the guarantee of being able to reference a unique row by a characteristic because SQL operates on sets of rows that have the same characteristic. SQL is a set-based language.

Without foreign keys, you wouldn't be able to merge data that is related but stored in different tables in a meaningful way. But when you know which column acts as a foreign key, you can always combine the data from separate tables in a single result set. All we require is that the person writing SQL queries needs to know the primary and foreign keys. It is not necessary to add that information to the database's metadata, although that is certainly an option. Have a look at the following code snippet:

CREATE TABLE dbo.Invoice
(
     InvoiceNr        int not null
     , InvoiceDate    date not null
     , CustomerID     int not null
     , SalesAmount    money not null
);
ALTER TABLE dbo.Invoice
ADD CONSTRAINT PK_Invoice PRIMARY KEY (InvoiceNr);
ALTER TABLE dbo.Invoice
ADD CONSTRAINT FK_Invoice_Customer FOREIGN KEY (CustomerID)
     REFERENCES dbo.Customer (CustomerID); 

Lines 1 to 7 create a new table called dbo.Invoice. This table has four columns: InvoiceNr, InvoiceDate, CustomerID, and SalesAmount. For each column, you can see the data type of the column – int, date, int, and money, respectively. The addition not null means that a column should have a value. It cannot be left empty when adding a new row. This part of the code would create a table without a primary key and without foreign key(s) when executed. Creating a table is perfectly OK in SQL Server; however, it is not a best practice.

Lines 9 and 10 use the ALTER TABLE statement to add a primary key to the dbo.Invoice table. This states that the InvoiceNr column should, from now on, be used as the primary key of this table. When there is already data in the table, the ALTER TABLE statement will fail if the actual values being used so far are not unique. Once the statement executes successfully, the table has a primary key, and the database will start enforcing the uniqueness of values within the InvoiceNr column. Uniqueness is now guaranteed.

Lines 12 to 14 of the code in the code snippet create a foreign key that references the dbo.Customer table using the CustomerID column in both tables. When executing these lines, the database will check for each value of CustomerID in the dbo.Invoice table, irrespective of whether a matching value exists in the dbo.Customer table. The code fails if the check fails. Failing the check would mean that you have invoices of non-existing customers.

The real value of a foreign key is referential integrity. Referential integrity is the guarantee that all values in the foreign key reference existing values in the primary key. Upon entering a new row in the dbo.Invoice table, the database checks whether the CustomerID value entered exists in the dbo.Customer table. The same check is performed when you change the CustomerID value of an existing invoice. Whenever a customer is deleted from the database, the database will first check whether that results in invoices without customers. Whenever a check fails, the operation is not performed by the database. An error is thrown instead.

Referential integrity

Referential integrity is the guarantee that references to other tables using foreign keys always reference existing rows.

You could change the default behavior of throwing errors. You can add extra properties to a foreign key:

  • ON DELETE CASCADE | NULL | DEFAULT
  • ON UPDATE CASCADE | NULL | DEFAULT

With the addition of an ON DELETE clause to a foreign key, you specify either CASCADE, NULL, or DEFAULT. Cascade means that, in the example of the code snippet, every time you delete a customer, all their invoices are automatically deleted as well. That, of course, guarantees that you do not have invoices of non-existing customers following deletion, but you also lose the invoice information. That doesn't seem a good strategy in this particular example. Be careful with using the CASCADE option.

Specifying NULL instead of CASCADE means that the CustomerID column in the dbo.Invoice table will be made empty (NULL in database terms) instead of deleting the invoice. You retain the invoice, but you lose the information regarding which customer the invoice belonged to. In our example, this option is not possible because our CustomerID column is defined as NOT NULL. It cannot be empty.

The DEFAULT option is only possible when the CustomerID column has a so-called DEFAULT constraint defined on it. This means that the column will automatically get a value when you enter a new row without specifying a value for the column explicitly. The DEFAULT option specified on the foreign key will assign the default value to the CustomerID column in the dbo.Invoices table when you delete a customer.

The same options apply to the ON UPDATE clause that you can add to a foreign key. They apply when you change the value in the CustomerID column in the dbo.Customer table.

Both options have little to no value when using surrogate keys. Changing surrogate keys is meaningless, which means you do not need the ON UPDATE clause. And we almost never delete data from databases.

The real value of foreign keys is the referential integrity they provide. Through foreign keys, you get more consistency in your data. When your data is incorrect or inconsistent, the information you derive from it becomes incorrect. And people base decisions on the information derived from the data. Good data is critical to businesses.

A point to note is that extra checks in the database always cost some extra time and processing power. You have to balance increased data quality against (a bit of) performance loss. Most of the time, quality should win over the other arguments. But if you need the highest throughput and scalability you can get, you may choose to leave out some quality checks to achieve this goal. As a database designer, you decide where the priorities lie. You always adjust your design to the intended use and you always have to prioritize different options that you need to choose from.

Recap

From all the available candidate keys in a table, you choose one to become the primary key. When using normalization or dimensional modeling techniques, the primary key is preferably a surrogate key. The database checks the uniqueness of entire rows in a table only in terms of the uniqueness of the values of the primary key.

Foreign keys provide you with a means to combine different but related tables. This is crucial in working with relational databases. Similar techniques are used when designing NoSQL databases as well but without tables. No matter which modeling technique you use, a relational database will always consist of multiple tables that are related using foreign keys.

The Check and Unique constraints

Both the primary key and the foreign key are called constraints, as you can see in the code snippet of the previous section. They limit the values that you can store in a column. For a primary key, the limitation (constraint) is that you cannot store a value that is already used in another row. For a foreign key, the constraint is that you can only use values that exist in the related table. There are two more constraints that we should mention.

The Unique constraint is really comparable to the primary key. A Unique constraint is applied to a column and ensures that the column can only store a value once. That allows you to use a surrogate key as a primary key and, at the same time, have the database check business keys for uniqueness.

With a Check constraint, you can limit the values a column can store by applying extra (but simple) rules. For instance, a Price column may have a money data type, making it a numerical column with precision to four decimal places. With a Check constraint, you can further specify that only positive values are allowed. Negative product prices do not make sense.

Look at the following code snippet:

ALTER TABLE dbo.Invoice
ADD CONSTRAINT CK_SalesAmount CHECK (SalesAmount > 0);

Once the code snippet here is executed, the SalesAmount column will no longer accept negative values.

Note

Constraints help to improve data quality in databases.

Now that we have learned about relational theory and how keys and constraints help us to improve the performance of the database and improve the quality of the data it contains, it is time to look at how we use the database. A different use case warrants a different design.

 

Types of workload

The performance and quality of databases not only depend on keys and constraints but also largely on the design of the database, on its schema. The chosen schema has a huge impact on the functioning of the database. In Figure 1.7, you can see a performance pyramid showing different aspects of a database, which is important as regards the database's performance. The basis of the pyramid is the schema. That is the main reason for this book:

Figure 1.7 – Performance pyramid

Figure 1.7 – Performance pyramid

When the schema design is good, you can further optimize a database by applying query best practices. The schema has a direct influence on how complex queries can become. Complex queries are error-prone (bad for the quality of the information derived from the data). Furthermore, complex queries are complex to execute for the database engine, resulting in poor performance.

Once your schema is correct and the queries are written according to best practices, we can further optimize the database by creating indexes. We will discuss indexes in Chapter 4, Provisioning and Implementing an Azure SQL DB. Writing good queries and creating indexes might be an iterative process since both have an impact on the other.

When queries are executed, you can change isolation levels. Isolation levels determine how locking and blocking work in the database engine. This balances query consistency against performance. Isolation levels are beyond the scope of this book, however.

Server tuning is a lot of things. On-premises, this might mean choosing proper hardware to run SQL Server on. In Azure, it means choosing the right service tier. Higher service tiers provide more resources (memory, storage, compute power) at a higher cost. Solving all your performance issues in this layer of the pyramid is costly. Besides, solving problems in this layer involves fixing the symptoms instead of fixing the actual underlying problem.

Issues in lower levels of the pyramid lead to issues in higher levels. Issues in the performance pyramid result in the following:

  • High I/O
  • High memory consumption
  • High CPU usage
  • Lots of locks

In the end, you either have a slow database or you are spending way too much money. Again, this is the purpose behind writing this book.

Because the schema determines what your queries look like and how they perform, the (type of) queries you need the most must be key in designing the schema. This is very true for all types of database, relational and NoSQL alike. You need to know the expected workload when designing a database.

Note

The expected/intended workload of a database is key when designing a database schema and choosing the proper modeling technique.

Databases are created for different types of use case. This means that different databases are used in different ways. Data Vault is a modeling technique that optimizes the database for flexible, long-term storage of historical data. You will learn more about Data Vault in Chapter 9, Data Vault Modeling. The focus is on storing data. Most modeling techniques focus on using the data and optimizing that usage, thereby optimizing the intended workload. There are two types of workloads (although in real life, a lot of databases are a mix of the two):

  • OLTP
  • OLAP

OLTP

OLTP stands for Online Transaction Processing. The word online does not refer to the internet. The term refers back to old-fashioned mainframes where you didn't interact with the system directly. You kept track of changes during the day and processed all changes during the night in batches of work. With relational databases, people started to interact with the database directly. You could "ask" (write) a query from your console and execute that query in real time.

The words Transaction Processing of OLTP refer to the workload. A webshop creates an OLTP workload to the database. Most Line of Business (LOB) applications do. Most applications that are part of the primary processes in a company generate OLTP workloads for their database. Besides the aforementioned webshop, you can think of Customer Relationship Management (CRM) databases, Enterprise Resource Planning (ERP) databases, and databases for financial applications or human resource applications.

Primary process databases are, most of the time, databases where new data comes into existence. New customer accounts are created, new products are entered, and (hopefully) new orders and invoices are created.

OLTP workloads have two main characteristics:

  • A lot of small queries are executed.
  • A lot of new rows are added to the database and existing rows need to be updated on a regular basis.

A lot of small queries are executed

Let's take the webshop as an example again. You search for a specific product. You get a list of options. You then select one of the options that looks interesting. The screen now shows a lot of descriptive properties of that product. Most of the information is about that one product. You might click on an alternative to see all the properties of that alternative. Every time you click on a product, you get all the columns of a single row. The entire Product table of a webshop might contain thousands of rows, but you are only interested in a single one at a time, or, at most, a couple to compare a number of alternatives.

If the webshop is a larger webshop, there will be multiple potential customers browsing through the product in the same manner. This leads to a lot of queries that query single rows, or a couple of rows at most, in other words, lots of small queries. You need the database to be fast for this type of query in order to get a responsive website.

A lot of new rows are added to the database and existing rows need to be updated on a regular basis

In the end, the webshop needs you to place orders. Maybe you need to create an account first. This means that a new row has to be entered in the Customer table. In addition, an invoice address and perhaps a different shipping address may have to be entered in the system as well, and all this to place a new order in the system that might consist of multiple order lines.

So, in addition to lots of small queries, the database needs to be able to handle lots of writes. Most of the writes insert single rows or, at most, a couple of rows at a time. Again, you need your database to be able to handle the writes quickly. Perhaps performance is less important for the writes, but scalability is an issue as well. Will the database be fast enough to handle lots of customers at the same time as your webshop grows in popularity?

Databases that focus on OLTP workloads should be normalized. You will learn how to normalize a database in Chapter 3, Normalizing Data.

OLAP

OLAP stands for Online Analytical Processing. Analytical processing means that we use the data for data analysis or analytics. Perhaps we are trying to answer questions such as what type of customers buy what type of products? What colors are most popular for my products? What else might a customer buy whenever they buy a product? We use this to create recommender systems. Most webshops use recommender systems nowadays. Webshops often say stuff like "People who bought this were also interested in…".

Another example might be straightforward sales reports that show how well we are doing from a sales perspective or an analysis of sick leave among employees.

All these types of queries are considered to be OLAP queries.

OLAP workloads also have two main characteristics:

  • The workload is (almost) read-only.
  • Most queries use large datasets.

Read-only

OLAP is about reporting and analysis. Most of the time, it is not part of the primary process. With OLAP, we don't create new data. We use the existing data that comes from primary processes. We use data to analyze. That means we read the data; we do not change the data or add data. That is to say, a process will write new data to the OLAP database and then users will use the data by only reading from the data.

Large datasets

Especially for something such as creating a recommender system, you need a lot of data. You analyze all orders from the last couple of months (or even years) to find products that are often purchased together. You are not interested in single individual orders. You are not interested in single products. You need a lot of orders to ascertain whether certain products are often purchased together. You read lots of rows from the database to do this.

A simple sales report showing sales by month and the growth in sales over time also requires lots of rows from the database. A sales report with year-to-date figures and growth compared to last year requires at least 2 years' worth of orders. That is a large dataset that you read from the database.

The report should be quick to display, especially when salespeople look at the report multiple times per day. They don't like to wait for a minute to see their report. This means that the database should be fast in retrieving these large datasets.

Databases that are OLAP-oriented often benefit from dimensional modeling. You will learn about dimensional modeling in Chapter 7, Dimensional Modeling.

 

Summary

Our current society thrives on data. People sometimes say, "Data is the new gold." Even though most of the digital data we create is not stored in databases (think about movies, music, pictures…), databases play a crucial role in our information-driven world.

It is important to consider what you want to do with your data and what the nature of your data is. Choose the right system for the job, whether that is a NoSQL database or a relational database. Choose the proper modeling technique to set up your database.

In this chapter, you learned mostly about relational databases, although much of what has been covered applies to databases in general. A relational database is a database based on set theory, where data is stored as rows in tables. Rows should be unique in tables. We use a primary key to enforce uniqueness, and we use foreign keys to relate tables to one another.

The best way to design a database depends on its intended usage. We will cover this in detail in Chapter 3, Normalizing Data, and in Chapter 7, Dimensional Modeling.

The next chapter focuses on entity analysis. Entity analysis is about what to store and how to store it.

About the Author
  • Peter ter Braake

    Peter Ter Braake started working as a developer in 1996 after studying physics in Utrecht, the Netherlands. Databases and business intelligence piqued his interest the most, leading to him specializing in SQL Server and its business intelligence components. He has worked with Power BI from the tool's very beginnings. Peter started working as an independent contractor in 2008. This has enabled him to divide his time between teaching data-related classes, consulting with customers, and writing articles and books. Peter has also authored Data Modeling for Azure Data Services, Packt.

    Browse publications by this author
Data Modeling for Azure Data Services
Unlock this book and the full library FREE for 7 days
Start now