Book Image

Data Modeling for Azure Data Services

By : Peter ter Braake
Book Image

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Data is at the heart of all applications and forms the foundation of modern data-driven businesses. With the multitude of data-related use cases and the availability of different data services, choosing the right service and implementing the right design becomes paramount to successful implementation. Data Modeling for Azure Data Services starts with an introduction to databases, entity analysis, and normalizing data. The book then shows you how to design a NoSQL database for optimal performance and scalability and covers how to provision and implement Azure SQL DB, Azure Cosmos DB, and Azure Synapse SQL Pool. As you progress through the chapters, you'll learn about data analytics, Azure Data Lake, and Azure SQL Data Warehouse and explore dimensional modeling, data vault modeling, along with designing and implementing a Data Lake using Azure Storage. You'll also learn how to implement ETL with Azure Data Factory. By the end of this book, you'll have a solid understanding of which Azure data services are the best fit for your model and how to implement the best design for your solution.
Table of Contents (16 chapters)
1
Section 1 – Operational/OLTP Databases
8
Section 2 – Analytics with a Data Lake and Data Warehouse
13
Section 3 – ETL with Azure Data Factory

Overview of relational databases

Databases hadn't yet been invented when we first started programming computer applications. All data had to be stored in files. Oftentimes, those files were simple comma-separated value files (CSV files). An example of a CSV file can be seen in the following screenshot:

Figure 1.1 – Person.csv

Figure 1.1 – Person.csv

As you can see, it is just some data without anything else.

Files

Using files to store data for use in applications entailed a number of issues. After trying file formats other than CSV files, developers started using databases instead of plain files. Plain files or flat files are files with just data stored in them. Figure 1.1 is an example of a flat file. Let's look into the issues that using flat files posed.

From the header of the screenshot in Figure 1.1, it is clear that the file is called Person.csv. We may infer that the data in the files represents persons. However, it is not clear whether those people are patients, customers, employees, or even someone completely different. Furthermore, you cannot ascertain that extra information from the file or its content.

Drawbacks

The use of these types of flat files to store data comes with three drawbacks:

  • You cannot infer from the file itself what the data is about.
  • It is not flexible from a programming perspective and is bad for performance when working with the data.
  • It is (almost) impossible for multiple persons to work with flat files simultaneously.

We will now examine each of these drawbacks in turn.

Drawback 1 – You cannot infer from the file itself what the data is about

It is clear from looking at the screenshot that each line has two commas, meaning that there are three columns per row. The second column very likely holds a first name. This is a reasonable assumption based on our knowledge of names, although you may require a knowledge of Dutch names to make this assumption. The third column is more difficult to guess. It could be the year of birth of the person in question, but it could also be a postal code or perhaps a monthly salary.

The file only stores the actual data and not the metadata. It may be that you can guess what the values mean, but you cannot infer it from the file itself. Metadata is the data describing the data. Column names are an example of metadata. Column names such as PatientID, Patient_FirstName, and PostalCode would already render it more readable. That is why we often add those column names as a first row in flat files.

Note

Metadata is data that describes the "actual" data.

There is even more to ascertain regarding this data. You cannot perform calculations with postal codes, such as adding up two postal codes (it may be that you can, but it doesn't make any sense). A postal code is an alphanumeric code that you cannot perform computations with. When the last column in Figure 1.1 is a salary and not a postal code, you do want (and need) to be able to perform calculations on this column, for instance, to calculate an annual salary from the monthly salaries. In this case, the column would have been numerical. In other words, it would be beneficial to know a column's data type. Generally speaking, data can be numerical, alphanumerical (text), or dates. Nowadays, of course, there are a lot of variations, such as binary data for pictures.

With data stored in flat files, the data itself and the metadata are stored separately.

Today, we have overcome some of these issues by not using flat files but storing data as XML or as JSON files. Both file types allow you to store metadata with the actual data in the file itself. In the (recent) past, this was too expensive to do. Only recently has storage become cheap enough and compute power plentiful enough to work with text files by storing data and metadata in the way that JSON does.

Drawback 2 – It is not flexible from a programming perspective and is bad for performance when working with the data

It gets nastier when we start using (old-fashioned) program code to work with the data. Suppose you need to know the postal code of the person called Janneke. Your code would now look something like this:

  1. Read a line.
  2. Read the second column.
  3. If the value you read equals Janneke, then return the third column.
  4. Repeat lines 1 to 3 until there are no more lines in the file.

With only two lines in the file, this is pretty fast. This code will become more problematic, however, when a file contains many, many rows. It will become really slow.

It gets even worse when someone changes the file structure. Suppose we add a new column, storing the patient's family name between the second and third columns. The code we just wrote will break because it assumes that the postal code is the third column. However, following the change, it is the fourth column. Retrieving the postal code should be independent of which column it actually is.

Drawback 3 – It is (almost) impossible for multiple persons to work with flat files simultaneously

In most applications, there will be multiple users working with the data simultaneously. What if your webshop could only have one visitor at a time? A database should make it easy for multiple people or processes to work with the same data at the same time. In the case of relational databases, this is part of their core. A relational database has what is known as the ACID properties to cater to multi-user workloads. You will learn more about the ACID properties in Chapter 5, Designing a NoSQL Database. Without a database system, whether relational or not, multiple users working with the same data would not be impossible, but you will get consistency issues if you don't implement complex logic to prevent inconsistencies.

If you always process all the data in a flat file as a whole, and you do that, for instance, during the night, flat files are fine to work with, as we will see in Chapter 10, Designing and Implementing a Data Lake Using Azure Storage. However, if you need to work with individual pieces of information from within a flat file in real time, you will not be able to do that in an acceptable manner.

At first, smart workarounds were invented to make working with flat files easier and more efficient. There were files such as ISAM files and VSAM files. It is beyond the scope of this book to go into these different file types. More interesting for us is the fact that the problems described in this paragraph led to the introduction of database management systems (DBMSes).

Relational databases

A database is a self-describing collection of related data with the aim of providing information to people and applications.

The first database appeared in the 1960s. These databases were hierarchical databases. A little later, network databases were introduced, but neither type of database offered the flexibility to work with (large amounts of) data in more complex organizations with multiple users.

In the early 1970s, E.F. Codd, an English mathematician working for IBM, came up with a theory of how to create relational databases. He based his theory on mathematical set theory. This theory describes sets of elements that are potentially really large in a few simple rules (that will be covered later in the chapter). Codd realized that mathematical set theory could not only be applied to something abstract such as all even numbers, but also to real live collections such as all our customers. This rendered set-based theory useable in relation to the data we were working with and the data we needed to store in databases.

The name relational database stems from the fact that data is stored in tables. For example, take a set of numbers {1, 2, 3, 4}. Then, imagine a second set, for instance, a set of names {Peter, Janneke, Jari, Mats}. We could combine these two sets in a table, as shown in Figure 1.2:

Figure 1.2 – A table of patients

Figure 1.2 – A table of patients

We started with independent sets of values. We created a relation between the two sets by combining them into a table. Making a table with rows is like saying the values 1 and Peter belong together, just as 2 and Janneke do. This makes the sets no longer independent. A relationship now exists between the values in one column and the values in another column. In other words, the table is the relation between the PatientID set and the PatientName set. Relation here is another word for Table.

Note

Relational databases store data in tables.

We see something more in Figure 1.2. The first column is called PatientID. The column header also specifies that the data type of the values in this column is int. This means that this column can only store whole numbers. The second column is of the nvarchar(50) type, specifying that it stores alphanumeric values (text) with a maximum length of 50 characters. This metadata is part of the table itself. The data and the metadata are now a whole instead of separately stored pieces of information.

Note

In a database, data and metadata are combined in a single structure.

A relational database is normally more than just one table. In real life, a relational database can consist of thousands of tables. According to the definition of a database, it is a collection of related data. This means that the tables have relationships with one another. Since relationship sounds a lot more like relational than table does, a lot of people came to believe that a relational database got its name from related tables. However, as stated previously, storing data in tables is what makes a database relational.

Relational Database Management System

A Relational Database Management System (RDBMS) is a piece of software that allows you to create and manage databases that adhere to Codd's theory. That turns an RDBMS into an application that allows you to create tables and then store data in those tables. The "management" part is all the extra "services" you get from an RDBMS, such as securing your data so that only authorized people can work with the data. An RDBMS allows you to work with data, from creating tables to storing and managing the data and its accessibility. Examples of well-known RDBMS systems include Microsoft SQL Server, Oracle, IBM DB2, MySQL, and MariaDB.

Note

An RDBMS is a database product that follows Codd's rules of the relational model, allowing you to work with and manage all your data.

We previously referred to a couple of problems that we encountered in the past when using CSV files. One was the lack of metadata. Relational databases rectify that problem. The question that remains to be answered is how relational databases offer the flexibility and performance needed that CSV files couldn't offer. To do that, we first need to introduce the SQL language.