Data Modeling for Azure Data Services

By : Peter ter Braake

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Data is at the heart of all applications and forms the foundation of modern data-driven businesses. With the multitude of data-related use cases and the availability of different data services, choosing the right service and implementing the right design becomes paramount to successful implementation. Data Modeling for Azure Data Services starts with an introduction to databases, entity analysis, and normalizing data. The book then shows you how to design a NoSQL database for optimal performance and scalability and covers how to provision and implement Azure SQL DB, Azure Cosmos DB, and Azure Synapse SQL Pool. As you progress through the chapters, you'll learn about data analytics, Azure Data Lake, and Azure SQL Data Warehouse and explore dimensional modeling, data vault modeling, along with designing and implementing a Data Lake using Azure Storage. You'll also learn how to implement ETL with Azure Data Factory. By the end of this book, you'll have a solid understanding of which Azure data services are the best fit for your model and how to implement the best design for your solution.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1 – Operational/OLTP Databases

Free Chapter

Chapter 1: Introduction to Databases

Overview of relational databases

Introduction to Structured Query Language

Impact of intended usage patterns on database design

Understanding relational theory

Keys

Types of workload

Summary

Chapter 2: Entity Analysis

Scope

Understanding entity relationship diagrams

Entities

Relationships

Creating your first ERD

Context of an ERD

Summary

Exercises

Chapter 3: Normalizing Data

When to use normalization as a design strategy

Preventing redundancy

The normalization steps

An alternative approach to normalizing data

Integrating separate results

Entity relationship diagram

Summary

Exercises

Chapter 4: Provisioning and Implementing an Azure SQL DB

Technical requirements

Understanding SQL Server data types

Quantifying the data model

Provisioning an Azure SQL database

Connecting to the database

Data definition language

Inserting data

Indexing

Summary

Chapter 5: Designing a NoSQL Database

Understanding big data

Understanding big data clusters

Getting to know Cosmos DB

Key-value databases

Other NoSQL databases

Extra considerations

Summary

Exercise

Chapter 6: Provisioning and Implementing an Azure Cosmos DB Database

Technical requirements

Provisioning a Cosmos DB database

Creating a container

Uploading documents to a container

Cosmos DB container settings

Importing data using the Azure Cosmos DB Data Migration tool

Summary

Section 2 – Analytics with a Data Lake and Data Warehouse

Chapter 7: Dimensional Modeling

Background to dimensional modeling

Understanding dimensional modeling

Steps in dimensional modeling

Designing dimensions

Designing fact tables

Using a Kimball data warehouse versus data marts

Summary

Exercise

Chapter 8: Provisioning and Implementing an Azure Synapse SQL Pool

Overview of Synapse Analytics

Provisioning a Synapse Analytics workspace

Creating a dedicated SQL pool

Implementing tables in Synapse SQL pools

Understanding workload management

Using PolyBase to load data

Connecting to and using a dedicated SQL pool

Summary

Chapter 9: Data Vault Modeling

Background to Data Vault modeling

Designing Hub tables

Designing Link tables

Designing Satellite tables

Using hash keys

Designing a Data Vault structure

Designing business vaults

Implementing a Data Vault

Summary

Exercise

Chapter 10: Designing and Implementing a Data Lake Using Azure Storage

Technical requirements

Background of data lakes

Modeling a data lake

Using different file formats

Choosing the proper file size

Provisioning an Azure storage account

Creating a data lake filesystem

Creating multiple storage accounts

Summary

Section 3 – ETL with Azure Data Factory

Chapter 11: Implementing ETL Using Azure Data Factory

Technical requirements

Introducing Azure Data Factory

Introducing the main components of Azure Data Factory

Using the copy activity

Implementing a data flow

Executing SQL code from Data Factory

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Overview of relational databases

Databases hadn't yet been invented when we first started programming computer applications. All data had to be stored in files. Oftentimes, those files were simple comma-separated value files (CSV files). An example of a CSV file can be seen in the following screenshot:

Figure 1.1 – Person.csv

As you can see, it is just some data without anything else.

Files

Using files to store data for use in applications entailed a number of issues. After trying file formats other than CSV files, developers started using databases instead of plain files. Plain files or flat files are files with just data stored in them. Figure 1.1 is an example of a flat file. Let's look into the issues that using flat files posed.

From the header of the screenshot in Figure 1.1, it is clear that the file is called Person.csv. We may infer that the data in the files represents persons. However, it is not clear whether those people are patients, customers, employees, or even someone completely different. Furthermore, you cannot ascertain that extra information from the file or its content.

Drawbacks

The use of these types of flat files to store data comes with three drawbacks:

You cannot infer from the file itself what the data is about.
It is not flexible from a programming perspective and is bad for performance when working with the data.
It is (almost) impossible for multiple persons to work with flat files simultaneously.

We will now examine each of these drawbacks in turn.

Drawback 1 – You cannot infer from the file itself what the data is about

It is clear from looking at the screenshot that each line has two commas, meaning that there are three columns per row. The second column very likely holds a first name. This is a reasonable assumption based on our knowledge of names, although you may require a knowledge of Dutch names to make this assumption. The third column is more difficult to guess. It could be the year of birth of the person in question, but it could also be a postal code or perhaps a monthly salary.

The file only stores the actual data and not the metadata. It may be that you can guess what the values mean, but you cannot infer it from the file itself. Metadata is the data describing the data. Column names are an example of metadata. Column names such as PatientID, Patient_FirstName, and PostalCode would already render it more readable. That is why we often add those column names as a first row in flat files.

Note

Metadata is data that describes the "actual" data.

There is even more to ascertain regarding this data. You cannot perform calculations with postal codes, such as adding up two postal codes (it may be that you can, but it doesn't make any sense). A postal code is an alphanumeric code that you cannot perform computations with. When the last column in Figure 1.1 is a salary and not a postal code, you do want (and need) to be able to perform calculations on this column, for instance, to calculate an annual salary from the monthly salaries. In this case, the column would have been numerical. In other words, it would be beneficial to know a column's data type. Generally speaking, data can be numerical, alphanumerical (text), or dates. Nowadays, of course, there are a lot of variations, such as binary data for pictures.

With data stored in flat files, the data itself and the metadata are stored separately.

Today, we have overcome some of these issues by not using flat files but storing data as XML or as JSON files. Both file types allow you to store metadata with the actual data in the file itself. In the (recent) past, this was too expensive to do. Only recently has storage become cheap enough and compute power plentiful enough to work with text files by storing data and metadata in the way that JSON does.

Drawback 2 – It is not flexible from a programming perspective and is bad for performance when working with the data

It gets nastier when we start using (old-fashioned) program code to work with the data. Suppose you need to know the postal code of the person called Janneke. Your code would now look something like this:

Read a line.
Read the second column.
If the value you read equals Janneke, then return the third column.
Repeat lines 1 to 3 until there are no more lines in the file.

With only two lines in the file, this is pretty fast. This code will become more problematic, however, when a file contains many, many rows. It will become really slow.

It gets even worse when someone changes the file structure. Suppose we add a new column, storing the patient's family name between the second and third columns. The code we just wrote will break because it assumes that the postal code is the third column. However, following the change, it is the fourth column. Retrieving the postal code should be independent of which column it actually is.

Drawback 3 – It is (almost) impossible for multiple persons to work with flat files simultaneously

In most applications, there will be multiple users working with the data simultaneously. What if your webshop could only have one visitor at a time? A database should make it easy for multiple people or processes to work with the same data at the same time. In the case of relational databases, this is part of their core. A relational database has what is known as the ACID properties to cater to multi-user workloads. You will learn more about the ACID properties in Chapter 5, Designing a NoSQL Database. Without a database system, whether relational or not, multiple users working with the same data would not be impossible, but you will get consistency issues if you don't implement complex logic to prevent inconsistencies.

If you always process all the data in a flat file as a whole, and you do that, for instance, during the night, flat files are fine to work with, as we will see in Chapter 10, Designing and Implementing a Data Lake Using Azure Storage. However, if you need to work with individual pieces of information from within a flat file in real time, you will not be able to do that in an acceptable manner.

At first, smart workarounds were invented to make working with flat files easier and more efficient. There were files such as ISAM files and VSAM files. It is beyond the scope of this book to go into these different file types. More interesting for us is the fact that the problems described in this paragraph led to the introduction of database management systems (DBMSes).

Relational databases

A database is a self-describing collection of related data with the aim of providing information to people and applications.

The first database appeared in the 1960s. These databases were hierarchical databases. A little later, network databases were introduced, but neither type of database offered the flexibility to work with (large amounts of) data in more complex organizations with multiple users.

In the early 1970s, E.F. Codd, an English mathematician working for IBM, came up with a theory of how to create relational databases. He based his theory on mathematical set theory. This theory describes sets of elements that are potentially really large in a few simple rules (that will be covered later in the chapter). Codd realized that mathematical set theory could not only be applied to something abstract such as all even numbers, but also to real live collections such as all our customers. This rendered set-based theory useable in relation to the data we were working with and the data we needed to store in databases.

The name relational database stems from the fact that data is stored in tables. For example, take a set of numbers {1, 2, 3, 4}. Then, imagine a second set, for instance, a set of names {Peter, Janneke, Jari, Mats}. We could combine these two sets in a table, as shown in Figure 1.2:

Figure 1.2 – A table of patients

We started with independent sets of values. We created a relation between the two sets by combining them into a table. Making a table with rows is like saying the values 1 and Peter belong together, just as 2 and Janneke do. This makes the sets no longer independent. A relationship now exists between the values in one column and the values in another column. In other words, the table is the relation between the PatientID set and the PatientName set. Relation here is another word for Table.

Note

Relational databases store data in tables.

We see something more in Figure 1.2. The first column is called PatientID. The column header also specifies that the data type of the values in this column is int. This means that this column can only store whole numbers. The second column is of the nvarchar(50) type, specifying that it stores alphanumeric values (text) with a maximum length of 50 characters. This metadata is part of the table itself. The data and the metadata are now a whole instead of separately stored pieces of information.

Note

In a database, data and metadata are combined in a single structure.

A relational database is normally more than just one table. In real life, a relational database can consist of thousands of tables. According to the definition of a database, it is a collection of related data. This means that the tables have relationships with one another. Since relationship sounds a lot more like relational than table does, a lot of people came to believe that a relational database got its name from related tables. However, as stated previously, storing data in tables is what makes a database relational.

Relational Database Management System

A Relational Database Management System (RDBMS) is a piece of software that allows you to create and manage databases that adhere to Codd's theory. That turns an RDBMS into an application that allows you to create tables and then store data in those tables. The "management" part is all the extra "services" you get from an RDBMS, such as securing your data so that only authorized people can work with the data. An RDBMS allows you to work with data, from creating tables to storing and managing the data and its accessibility. Examples of well-known RDBMS systems include Microsoft SQL Server, Oracle, IBM DB2, MySQL, and MariaDB.

Note

An RDBMS is a database product that follows Codd's rules of the relational model, allowing you to work with and manage all your data.

We previously referred to a couple of problems that we encountered in the past when using CSV files. One was the lack of metadata. Relational databases rectify that problem. The question that remains to be answered is how relational databases offer the flexibility and performance needed that CSV files couldn't offer. To do that, we first need to introduce the SQL language.

Data Modeling for Azure Data Services

By : Peter ter Braake

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Related Content you might be interested in

Current Title:

Data Modeling for Azure Data Services

Cloud Scale Analytics with Azure Data Services

Limitless Analytics with Azure Synapse

Azure Synapse Analytics Cookbook

Overview of relational databases

Files

Drawbacks

Drawback 1 – You cannot infer from the file itself what the data is about

Drawback 2 – It is not flexible from a programming perspective and is bad for performance when working with the data

Drawback 3 – It is (almost) impossible for multiple persons to work with flat files simultaneously

Relational databases

Relational Database Management System