Data Modeling for Azure Data Services

By : Peter ter Braake

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Data is at the heart of all applications and forms the foundation of modern data-driven businesses. With the multitude of data-related use cases and the availability of different data services, choosing the right service and implementing the right design becomes paramount to successful implementation. Data Modeling for Azure Data Services starts with an introduction to databases, entity analysis, and normalizing data. The book then shows you how to design a NoSQL database for optimal performance and scalability and covers how to provision and implement Azure SQL DB, Azure Cosmos DB, and Azure Synapse SQL Pool. As you progress through the chapters, you'll learn about data analytics, Azure Data Lake, and Azure SQL Data Warehouse and explore dimensional modeling, data vault modeling, along with designing and implementing a Data Lake using Azure Storage. You'll also learn how to implement ETL with Azure Data Factory. By the end of this book, you'll have a solid understanding of which Azure data services are the best fit for your model and how to implement the best design for your solution.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1 – Operational/OLTP Databases

Free Chapter

Chapter 1: Introduction to Databases

Overview of relational databases

Introduction to Structured Query Language

Impact of intended usage patterns on database design

Understanding relational theory

Keys

Types of workload

Summary

Chapter 2: Entity Analysis

Scope

Understanding entity relationship diagrams

Entities

Relationships

Creating your first ERD

Context of an ERD

Summary

Exercises

Chapter 3: Normalizing Data

When to use normalization as a design strategy

Preventing redundancy

The normalization steps

An alternative approach to normalizing data

Integrating separate results

Entity relationship diagram

Summary

Exercises

Chapter 4: Provisioning and Implementing an Azure SQL DB

Technical requirements

Understanding SQL Server data types

Quantifying the data model

Provisioning an Azure SQL database

Connecting to the database

Data definition language

Inserting data

Indexing

Summary

Chapter 5: Designing a NoSQL Database

Understanding big data

Understanding big data clusters

Getting to know Cosmos DB

Key-value databases

Other NoSQL databases

Extra considerations

Summary

Exercise

Chapter 6: Provisioning and Implementing an Azure Cosmos DB Database

Technical requirements

Provisioning a Cosmos DB database

Creating a container

Uploading documents to a container

Cosmos DB container settings

Importing data using the Azure Cosmos DB Data Migration tool

Summary

Section 2 – Analytics with a Data Lake and Data Warehouse

Chapter 7: Dimensional Modeling

Background to dimensional modeling

Understanding dimensional modeling

Steps in dimensional modeling

Designing dimensions

Designing fact tables

Using a Kimball data warehouse versus data marts

Summary

Exercise

Chapter 8: Provisioning and Implementing an Azure Synapse SQL Pool

Overview of Synapse Analytics

Provisioning a Synapse Analytics workspace

Creating a dedicated SQL pool

Implementing tables in Synapse SQL pools

Understanding workload management

Using PolyBase to load data

Connecting to and using a dedicated SQL pool

Summary

Chapter 9: Data Vault Modeling

Background to Data Vault modeling

Designing Hub tables

Designing Link tables

Designing Satellite tables

Using hash keys

Designing a Data Vault structure

Designing business vaults

Implementing a Data Vault

Summary

Exercise

Chapter 10: Designing and Implementing a Data Lake Using Azure Storage

Technical requirements

Background of data lakes

Modeling a data lake

Using different file formats

Choosing the proper file size

Provisioning an Azure storage account

Creating a data lake filesystem

Creating multiple storage accounts

Summary

Section 3 – ETL with Azure Data Factory

Chapter 11: Implementing ETL Using Azure Data Factory

Technical requirements

Introducing Azure Data Factory

Introducing the main components of Azure Data Factory

Using the copy activity

Implementing a data flow

Executing SQL code from Data Factory

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Introduction to Structured Query Language

Structured Query Language (SQL) is the language of all relational databases. You use SQL to read data from the database and to manipulate existing data. Creating a database or creating tables within the database and securing your data is also done using SQL. Database developers might use a tool to graphically create databases or tables, so you don't need to write SQL code yourself. The tool will generate the SQL code for you because that is all the actual database engine understands.

Different categories of SQL

SQL consists of three main categories:

DCL – Data Control Language
DDL – Data Definition Language
DML – Data Manipulation Language

Data Control Language

DCL is the part of SQL that helps to secure the data. DCL comprises three statements:

GRANT
DENY
REVOKE

Even though securing data is very important in any data solution that you build, DCL is outside the scope of this book. You should check out the TutorialRide website to learn more about this: https://www.tutorialride.com/dbms/sql-data-control-language-dcl.htm.

Data Definition Language

With DDL statements, you create databases themselves and objects within databases. As with DCL, there are three statements:

CREATE
ALTER
DROP

With CREATE TABLE, you can make (create) a new table. Whenever the table structure needs to change, for instance, you want to add a column to an existing table, you use UPDATE TABLE. With DROP TABLE, you completely remove a table and all its content. You will learn about these statements in Chapter 4, Provisioning and Implementing an Azure SQL DB.

Data Manipulation Language

DML is the part of SQL that is used for working with the actual data stored in the tables in the database. DML has four statements:

SELECT
INSERT
UPDATE
DELETE

The SELECT statement lets you read data from tables. Some people believe that SELECT should be a category of its own: DQL or Data Query Language. The other three statements are self-explanatory: INSERT adds new rows to an existing table, UPDATE changes the values of already existing rows in a table, and DELETE removes rows from a table.

This book is not about SQL. There are a lot of tutorials on the internet on SQL. If you need to familiarize yourself with T-SQL (the dialect of SQL Server that we will use throughout this book), I strongly recommend the books of Itzik Ben-Gan. SQL is also used a lot in NoSQL databases and is still the basis for every data professional.

Understanding the database schema

With relational databases, the first step is to create a table using CREATE TABLE. While creating the table, you specify the name of the new table. You also specify all the columns of that table by adding a column name and the data type of the column. This means you start by creating the metadata. All the metadata combined is referred to as the database schema.

Often, people merely mean the table structure, the tables, and their relationships when they use the term schema. In SQL Server, there is even an object called a schema that helps in establishing a good security strategy for your database.

Note

The schema of a database refers to all tables and their relationships.

Once you have created tables, you can start loading data into the table using the INSERT statement. With each row you enter, you provide values for all the columns in the table. The values you enter should follow the rules defined in the table definition. With what we have learned so far, this means that the values should be of the correct data type. You cannot enter text in a numerical column. You will see shortly that there are further constraints. The process of creating a table first and then entering data means that whatever data we add (for example, INSERT) has to adhere to that structure.

After you have entered data into the database, you can start working with the data. With the SELECT statement, you can read data from the database. The database uses the existing metadata while retrieving data from the database. For instance, you could write the following SELECT statement:

SELECT
     PostalCode
FROM
     Persons
WHERE
     Name = 'Janneke';

Notice that this statement is the same example as described in the section about files. In the preceding snippet, we read the PostalCode column from the Persons table for the row that has 'Janneke' as the value in the Name column. The database uses the metadata, in this case, the table name and column names, to retrieve the data and, where possible, to optimize the query. By using metadata, it doesn't matter whether the PostalCode field is the second or the third column. Using metadata makes querying the data more flexible.

In addition to the flexibility we gained by using a table over a flat file, there is no step such as repeat this for each row, as we saw in the section on files. A table in a relational database implicitly works with all rows. That is called working set-based. In relational databases, you don't work with individual rows but always with sets. Depending on the filters you provide (the WHERE clause of the SQL statement), your set might contain just one row or it might contain multiple rows, or it can even be an empty set with no rows at all. The database engine can optimize the query for fast query response times. The database engine uses its knowledge of the metadata. It can also take advantage of extra structures that you may define in the database, such as indexes. You will learn about indexes in Chapter 4, Provisioning and Implementing an Azure SQL DB.

Note

Database systems take advantage of all the metadata defined in the schema to optimize queries in order to obtain good query performance even in the case of large datasets.

Now that we have learned why (relational) databases work better for storing data than CSV files, let's look at the different use cases of databases and the impact this has on database design.

Data Modeling for Azure Data Services

By : Peter ter Braake

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Related Content you might be interested in

Current Title:

Data Modeling for Azure Data Services

Cloud Scale Analytics with Azure Data Services

Limitless Analytics with Azure Synapse

Azure Synapse Analytics Cookbook

Introduction to Structured Query Language

Different categories of SQL

Data Control Language

Data Definition Language

Data Manipulation Language

Understanding the database schema