Book Image

Data Modeling for Azure Data Services

By : Peter ter Braake
Book Image

Data Modeling for Azure Data Services

By: Peter ter Braake

Overview of this book

Data is at the heart of all applications and forms the foundation of modern data-driven businesses. With the multitude of data-related use cases and the availability of different data services, choosing the right service and implementing the right design becomes paramount to successful implementation. Data Modeling for Azure Data Services starts with an introduction to databases, entity analysis, and normalizing data. The book then shows you how to design a NoSQL database for optimal performance and scalability and covers how to provision and implement Azure SQL DB, Azure Cosmos DB, and Azure Synapse SQL Pool. As you progress through the chapters, you'll learn about data analytics, Azure Data Lake, and Azure SQL Data Warehouse and explore dimensional modeling, data vault modeling, along with designing and implementing a Data Lake using Azure Storage. You'll also learn how to implement ETL with Azure Data Factory. By the end of this book, you'll have a solid understanding of which Azure data services are the best fit for your model and how to implement the best design for your solution.
Table of Contents (16 chapters)
1
Section 1 – Operational/OLTP Databases
8
Section 2 – Analytics with a Data Lake and Data Warehouse
13
Section 3 – ETL with Azure Data Factory

Types of workload

The performance and quality of databases not only depend on keys and constraints but also largely on the design of the database, on its schema. The chosen schema has a huge impact on the functioning of the database. In Figure 1.7, you can see a performance pyramid showing different aspects of a database, which is important as regards the database's performance. The basis of the pyramid is the schema. That is the main reason for this book:

Figure 1.7 – Performance pyramid

Figure 1.7 – Performance pyramid

When the schema design is good, you can further optimize a database by applying query best practices. The schema has a direct influence on how complex queries can become. Complex queries are error-prone (bad for the quality of the information derived from the data). Furthermore, complex queries are complex to execute for the database engine, resulting in poor performance.

Once your schema is correct and the queries are written according to best practices, we can further optimize the database by creating indexes. We will discuss indexes in Chapter 4, Provisioning and Implementing an Azure SQL DB. Writing good queries and creating indexes might be an iterative process since both have an impact on the other.

When queries are executed, you can change isolation levels. Isolation levels determine how locking and blocking work in the database engine. This balances query consistency against performance. Isolation levels are beyond the scope of this book, however.

Server tuning is a lot of things. On-premises, this might mean choosing proper hardware to run SQL Server on. In Azure, it means choosing the right service tier. Higher service tiers provide more resources (memory, storage, compute power) at a higher cost. Solving all your performance issues in this layer of the pyramid is costly. Besides, solving problems in this layer involves fixing the symptoms instead of fixing the actual underlying problem.

Issues in lower levels of the pyramid lead to issues in higher levels. Issues in the performance pyramid result in the following:

  • High I/O
  • High memory consumption
  • High CPU usage
  • Lots of locks

In the end, you either have a slow database or you are spending way too much money. Again, this is the purpose behind writing this book.

Because the schema determines what your queries look like and how they perform, the (type of) queries you need the most must be key in designing the schema. This is very true for all types of database, relational and NoSQL alike. You need to know the expected workload when designing a database.

Note

The expected/intended workload of a database is key when designing a database schema and choosing the proper modeling technique.

Databases are created for different types of use case. This means that different databases are used in different ways. Data Vault is a modeling technique that optimizes the database for flexible, long-term storage of historical data. You will learn more about Data Vault in Chapter 9, Data Vault Modeling. The focus is on storing data. Most modeling techniques focus on using the data and optimizing that usage, thereby optimizing the intended workload. There are two types of workloads (although in real life, a lot of databases are a mix of the two):

  • OLTP
  • OLAP

OLTP

OLTP stands for Online Transaction Processing. The word online does not refer to the internet. The term refers back to old-fashioned mainframes where you didn't interact with the system directly. You kept track of changes during the day and processed all changes during the night in batches of work. With relational databases, people started to interact with the database directly. You could "ask" (write) a query from your console and execute that query in real time.

The words Transaction Processing of OLTP refer to the workload. A webshop creates an OLTP workload to the database. Most Line of Business (LOB) applications do. Most applications that are part of the primary processes in a company generate OLTP workloads for their database. Besides the aforementioned webshop, you can think of Customer Relationship Management (CRM) databases, Enterprise Resource Planning (ERP) databases, and databases for financial applications or human resource applications.

Primary process databases are, most of the time, databases where new data comes into existence. New customer accounts are created, new products are entered, and (hopefully) new orders and invoices are created.

OLTP workloads have two main characteristics:

  • A lot of small queries are executed.
  • A lot of new rows are added to the database and existing rows need to be updated on a regular basis.

A lot of small queries are executed

Let's take the webshop as an example again. You search for a specific product. You get a list of options. You then select one of the options that looks interesting. The screen now shows a lot of descriptive properties of that product. Most of the information is about that one product. You might click on an alternative to see all the properties of that alternative. Every time you click on a product, you get all the columns of a single row. The entire Product table of a webshop might contain thousands of rows, but you are only interested in a single one at a time, or, at most, a couple to compare a number of alternatives.

If the webshop is a larger webshop, there will be multiple potential customers browsing through the product in the same manner. This leads to a lot of queries that query single rows, or a couple of rows at most, in other words, lots of small queries. You need the database to be fast for this type of query in order to get a responsive website.

A lot of new rows are added to the database and existing rows need to be updated on a regular basis

In the end, the webshop needs you to place orders. Maybe you need to create an account first. This means that a new row has to be entered in the Customer table. In addition, an invoice address and perhaps a different shipping address may have to be entered in the system as well, and all this to place a new order in the system that might consist of multiple order lines.

So, in addition to lots of small queries, the database needs to be able to handle lots of writes. Most of the writes insert single rows or, at most, a couple of rows at a time. Again, you need your database to be able to handle the writes quickly. Perhaps performance is less important for the writes, but scalability is an issue as well. Will the database be fast enough to handle lots of customers at the same time as your webshop grows in popularity?

Databases that focus on OLTP workloads should be normalized. You will learn how to normalize a database in Chapter 3, Normalizing Data.

OLAP

OLAP stands for Online Analytical Processing. Analytical processing means that we use the data for data analysis or analytics. Perhaps we are trying to answer questions such as what type of customers buy what type of products? What colors are most popular for my products? What else might a customer buy whenever they buy a product? We use this to create recommender systems. Most webshops use recommender systems nowadays. Webshops often say stuff like "People who bought this were also interested in…".

Another example might be straightforward sales reports that show how well we are doing from a sales perspective or an analysis of sick leave among employees.

All these types of queries are considered to be OLAP queries.

OLAP workloads also have two main characteristics:

  • The workload is (almost) read-only.
  • Most queries use large datasets.

Read-only

OLAP is about reporting and analysis. Most of the time, it is not part of the primary process. With OLAP, we don't create new data. We use the existing data that comes from primary processes. We use data to analyze. That means we read the data; we do not change the data or add data. That is to say, a process will write new data to the OLAP database and then users will use the data by only reading from the data.

Large datasets

Especially for something such as creating a recommender system, you need a lot of data. You analyze all orders from the last couple of months (or even years) to find products that are often purchased together. You are not interested in single individual orders. You are not interested in single products. You need a lot of orders to ascertain whether certain products are often purchased together. You read lots of rows from the database to do this.

A simple sales report showing sales by month and the growth in sales over time also requires lots of rows from the database. A sales report with year-to-date figures and growth compared to last year requires at least 2 years' worth of orders. That is a large dataset that you read from the database.

The report should be quick to display, especially when salespeople look at the report multiple times per day. They don't like to wait for a minute to see their report. This means that the database should be fast in retrieving these large datasets.

Databases that are OLAP-oriented often benefit from dimensional modeling. You will learn about dimensional modeling in Chapter 7, Dimensional Modeling.