Data Modeling for Azure Data Services

By : Peter ter Braake

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Data is at the heart of all applications and forms the foundation of modern data-driven businesses. With the multitude of data-related use cases and the availability of different data services, choosing the right service and implementing the right design becomes paramount to successful implementation. Data Modeling for Azure Data Services starts with an introduction to databases, entity analysis, and normalizing data. The book then shows you how to design a NoSQL database for optimal performance and scalability and covers how to provision and implement Azure SQL DB, Azure Cosmos DB, and Azure Synapse SQL Pool. As you progress through the chapters, you'll learn about data analytics, Azure Data Lake, and Azure SQL Data Warehouse and explore dimensional modeling, data vault modeling, along with designing and implementing a Data Lake using Azure Storage. You'll also learn how to implement ETL with Azure Data Factory. By the end of this book, you'll have a solid understanding of which Azure data services are the best fit for your model and how to implement the best design for your solution.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Section 1 – Operational/OLTP Databases

Free Chapter

Chapter 1: Introduction to Databases

Overview of relational databases

Introduction to Structured Query Language

Impact of intended usage patterns on database design

Understanding relational theory

Keys

Types of workload

Summary

Chapter 2: Entity Analysis

Scope

Understanding entity relationship diagrams

Entities

Relationships

Creating your first ERD

Context of an ERD

Summary

Exercises

Chapter 3: Normalizing Data

When to use normalization as a design strategy

Preventing redundancy

The normalization steps

An alternative approach to normalizing data

Integrating separate results

Entity relationship diagram

Summary

Exercises

Chapter 4: Provisioning and Implementing an Azure SQL DB

Technical requirements

Understanding SQL Server data types

Quantifying the data model

Provisioning an Azure SQL database

Connecting to the database

Data definition language

Inserting data

Indexing

Summary

Chapter 5: Designing a NoSQL Database

Understanding big data

Understanding big data clusters

Getting to know Cosmos DB

Key-value databases

Other NoSQL databases

Extra considerations

Summary

Exercise

Chapter 6: Provisioning and Implementing an Azure Cosmos DB Database

Technical requirements

Provisioning a Cosmos DB database

Creating a container

Uploading documents to a container

Cosmos DB container settings

Importing data using the Azure Cosmos DB Data Migration tool

Summary

Section 2 – Analytics with a Data Lake and Data Warehouse

Chapter 7: Dimensional Modeling

Background to dimensional modeling

Understanding dimensional modeling

Steps in dimensional modeling

Designing dimensions

Designing fact tables

Using a Kimball data warehouse versus data marts

Summary

Exercise

Chapter 8: Provisioning and Implementing an Azure Synapse SQL Pool

Overview of Synapse Analytics

Provisioning a Synapse Analytics workspace

Creating a dedicated SQL pool

Implementing tables in Synapse SQL pools

Understanding workload management

Using PolyBase to load data

Connecting to and using a dedicated SQL pool

Summary

Chapter 9: Data Vault Modeling

Background to Data Vault modeling

Designing Hub tables

Designing Link tables

Designing Satellite tables

Using hash keys

Designing a Data Vault structure

Designing business vaults

Implementing a Data Vault

Summary

Exercise

Chapter 10: Designing and Implementing a Data Lake Using Azure Storage

Technical requirements

Background of data lakes

Modeling a data lake

Using different file formats

Choosing the proper file size

Provisioning an Azure storage account

Creating a data lake filesystem

Creating multiple storage accounts

Summary

Section 3 – ETL with Azure Data Factory

Chapter 11: Implementing ETL Using Azure Data Factory

Technical requirements

Introducing Azure Data Factory

Introducing the main components of Azure Data Factory

Using the copy activity

Implementing a data flow

Executing SQL code from Data Factory

Summary

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Types of workload

The performance and quality of databases not only depend on keys and constraints but also largely on the design of the database, on its schema. The chosen schema has a huge impact on the functioning of the database. In Figure 1.7, you can see a performance pyramid showing different aspects of a database, which is important as regards the database's performance. The basis of the pyramid is the schema. That is the main reason for this book:

Figure 1.7 – Performance pyramid

When the schema design is good, you can further optimize a database by applying query best practices. The schema has a direct influence on how complex queries can become. Complex queries are error-prone (bad for the quality of the information derived from the data). Furthermore, complex queries are complex to execute for the database engine, resulting in poor performance.

Once your schema is correct and the queries are written according to best practices, we can further optimize the database by creating indexes. We will discuss indexes in Chapter 4, Provisioning and Implementing an Azure SQL DB. Writing good queries and creating indexes might be an iterative process since both have an impact on the other.

When queries are executed, you can change isolation levels. Isolation levels determine how locking and blocking work in the database engine. This balances query consistency against performance. Isolation levels are beyond the scope of this book, however.

Server tuning is a lot of things. On-premises, this might mean choosing proper hardware to run SQL Server on. In Azure, it means choosing the right service tier. Higher service tiers provide more resources (memory, storage, compute power) at a higher cost. Solving all your performance issues in this layer of the pyramid is costly. Besides, solving problems in this layer involves fixing the symptoms instead of fixing the actual underlying problem.

Issues in lower levels of the pyramid lead to issues in higher levels. Issues in the performance pyramid result in the following:

High I/O
High memory consumption
High CPU usage
Lots of locks

In the end, you either have a slow database or you are spending way too much money. Again, this is the purpose behind writing this book.

Because the schema determines what your queries look like and how they perform, the (type of) queries you need the most must be key in designing the schema. This is very true for all types of database, relational and NoSQL alike. You need to know the expected workload when designing a database.

Note

The expected/intended workload of a database is key when designing a database schema and choosing the proper modeling technique.

Databases are created for different types of use case. This means that different databases are used in different ways. Data Vault is a modeling technique that optimizes the database for flexible, long-term storage of historical data. You will learn more about Data Vault in Chapter 9, Data Vault Modeling. The focus is on storing data. Most modeling techniques focus on using the data and optimizing that usage, thereby optimizing the intended workload. There are two types of workloads (although in real life, a lot of databases are a mix of the two):

OLTP
OLAP

OLTP

OLTP stands for Online Transaction Processing. The word online does not refer to the internet. The term refers back to old-fashioned mainframes where you didn't interact with the system directly. You kept track of changes during the day and processed all changes during the night in batches of work. With relational databases, people started to interact with the database directly. You could "ask" (write) a query from your console and execute that query in real time.

The words Transaction Processing of OLTP refer to the workload. A webshop creates an OLTP workload to the database. Most Line of Business (LOB) applications do. Most applications that are part of the primary processes in a company generate OLTP workloads for their database. Besides the aforementioned webshop, you can think of Customer Relationship Management (CRM) databases, Enterprise Resource Planning (ERP) databases, and databases for financial applications or human resource applications.

Primary process databases are, most of the time, databases where new data comes into existence. New customer accounts are created, new products are entered, and (hopefully) new orders and invoices are created.

OLTP workloads have two main characteristics:

A lot of small queries are executed.
A lot of new rows are added to the database and existing rows need to be updated on a regular basis.

A lot of small queries are executed

Let's take the webshop as an example again. You search for a specific product. You get a list of options. You then select one of the options that looks interesting. The screen now shows a lot of descriptive properties of that product. Most of the information is about that one product. You might click on an alternative to see all the properties of that alternative. Every time you click on a product, you get all the columns of a single row. The entire Product table of a webshop might contain thousands of rows, but you are only interested in a single one at a time, or, at most, a couple to compare a number of alternatives.

If the webshop is a larger webshop, there will be multiple potential customers browsing through the product in the same manner. This leads to a lot of queries that query single rows, or a couple of rows at most, in other words, lots of small queries. You need the database to be fast for this type of query in order to get a responsive website.

A lot of new rows are added to the database and existing rows need to be updated on a regular basis

In the end, the webshop needs you to place orders. Maybe you need to create an account first. This means that a new row has to be entered in the Customer table. In addition, an invoice address and perhaps a different shipping address may have to be entered in the system as well, and all this to place a new order in the system that might consist of multiple order lines.

So, in addition to lots of small queries, the database needs to be able to handle lots of writes. Most of the writes insert single rows or, at most, a couple of rows at a time. Again, you need your database to be able to handle the writes quickly. Perhaps performance is less important for the writes, but scalability is an issue as well. Will the database be fast enough to handle lots of customers at the same time as your webshop grows in popularity?

Databases that focus on OLTP workloads should be normalized. You will learn how to normalize a database in Chapter 3, Normalizing Data.

OLAP

OLAP stands for Online Analytical Processing. Analytical processing means that we use the data for data analysis or analytics. Perhaps we are trying to answer questions such as what type of customers buy what type of products? What colors are most popular for my products? What else might a customer buy whenever they buy a product? We use this to create recommender systems. Most webshops use recommender systems nowadays. Webshops often say stuff like "People who bought this were also interested in…".

Another example might be straightforward sales reports that show how well we are doing from a sales perspective or an analysis of sick leave among employees.

All these types of queries are considered to be OLAP queries.

OLAP workloads also have two main characteristics:

The workload is (almost) read-only.
Most queries use large datasets.

Read-only

OLAP is about reporting and analysis. Most of the time, it is not part of the primary process. With OLAP, we don't create new data. We use the existing data that comes from primary processes. We use data to analyze. That means we read the data; we do not change the data or add data. That is to say, a process will write new data to the OLAP database and then users will use the data by only reading from the data.

Large datasets

Especially for something such as creating a recommender system, you need a lot of data. You analyze all orders from the last couple of months (or even years) to find products that are often purchased together. You are not interested in single individual orders. You are not interested in single products. You need a lot of orders to ascertain whether certain products are often purchased together. You read lots of rows from the database to do this.

A simple sales report showing sales by month and the growth in sales over time also requires lots of rows from the database. A sales report with year-to-date figures and growth compared to last year requires at least 2 years' worth of orders. That is a large dataset that you read from the database.

The report should be quick to display, especially when salespeople look at the report multiple times per day. They don't like to wait for a minute to see their report. This means that the database should be fast in retrieving these large datasets.

Databases that are OLAP-oriented often benefit from dimensional modeling. You will learn about dimensional modeling in Chapter 7, Dimensional Modeling.

Data Modeling for Azure Data Services

By : Peter ter Braake

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Related Content you might be interested in

Current Title:

Data Modeling for Azure Data Services

Cloud Scale Analytics with Azure Data Services

Limitless Analytics with Azure Synapse

Azure Synapse Analytics Cookbook

Types of workload

OLTP

A lot of small queries are executed

A lot of new rows are added to the database and existing rows need to be updated on a regular basis

OLAP

Read-only

Large datasets