Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Data Engineering with dbt
  • Table Of Contents Toc
  • Feedback & Rating feedback
Data Engineering with dbt

Data Engineering with dbt

By : Zagni
4.6 (9)
close
close
Data Engineering with dbt

Data Engineering with dbt

4.6 (9)
By: Zagni

Overview of this book

dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps. This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases you encounter in your daily work. By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations, including master data and the desired business rules, and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.
Table of Contents (21 chapters)
close
close
1
Part 1: The Foundations of Data Engineering
7
Part 2: Agile Data Engineering with dbt
14
Part 3: Hands-On Best Practices for Simple, Future-Proof Data Platforms

Advanced – introducing window functions

A window function is a SQL function that operates on a set of rows, called a window of rows.

These functions are also called analytic functions, as they are mostly used in analytic queries.

A window is a set of correlated rows, containing any amount from 0 to all the rows resulting from a query.

Generally, the windows divide the resulting rows into disjointed partitions.

Some window functions can also operate on a window frame, which is a dynamic subset of the rows defined by the window, based on the current row.

You can define a cumulative window frame, taking the rows of the window before or after the current row, or a sliding window frame, going from some rows before the current row to some rows after the current row.

Window functions have a lot in common with aggregated functions, the ones that use GROUP BY, as they both divide the results of a query into groups, that is, partitions, and then calculate the function on the rows in each group.

The big difference is that with GROUP BY, you get only one row per group, while with window functions, all the rows are returned, with the function calculated on the desired subset of rows.

Let’s show some examples to make the concept more real.

Assume we have a table with orders, and we would like to know how much each order contributes to the total by day or month. We can calculate the percentage with the following formula:

monthly_pct = order_total / monthly_total * 100

Using group by, you could calculate the order total for each day or month, in separate queries, and then you would need to join those results with the original table to be able to apply the formula.

Window functions are here to help, as you can calculate the daily or monthly measures in a single step by defining a window that puts the rows together in different ways, as in the following query:

SELECT
  O_ORDERKEY,
  O_CUSTKEY,
  O_ORDERDATE,
  O_TOTALPRICE,
  avg(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_avg,
  sum(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_total,
  sum(O_TOTALPRICE) over(partition by
          DATE_TRUNC(MONTH, O_ORDERDATE)) as monthly_total,
  O_TOTALPRICE / daily_avg * 100 as avg_pct,
  O_TOTALPRICE / daily_total * 100 as day_pct,
  O_TOTALPRICE / monthly_total * 100 as month_pct
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
QUALIFY row_number() over(partition by O_ORDERDATE
                          order by O_TOTALPRICE DESC) <= 5
order by O_ORDERDATE, O_TOTALPRICE desc;

This example uses window functions to calculate totals and averages over different periods and also to limit the returned rows to the five biggest orders for each day.

This query is pretty efficient, executing in a few seconds even after processing million of rows. By changing the schema from TPCH_SF1 to TPCH_SF10, TPCH_SF100, or TPCH_SF1000, you can try out bigger tables in the Snowflake sample data, to see how it works with bigger datasets.

Window definition

A window is defined using the OVER clause, which specifies how to partition and order the rows of the incoming data to apply the window function.

To use a window function, you use this syntax:

<function_name> ([<arguments>])
    OVER ( [PARTITION BY <part_expr>] [ORDER BY <order_expr>])

The presence of the OVER clause specifies that the function is applied to the windows defined inside the OVER clause. Of course, the function can also take its own arguments.

The PARTITION BY clause defines how the rows are partitioned and <part_expr> is a comma-separated list of columns or expressions calculated on the query data, in a similar way to what we have seen for the group by expressions and shown in the previous example.

The ORDER BY clause specifies the ordering of the rows to be used to calculate the function and it is not visible outside of the window function. The order by clause and <order_expr> work pretty much as they do in the SELECT statement to order the results of the query.

At the syntax level, both the PARTITION BY and ORDER BY clauses are optional, even if some functions require or forbid one of the two. Look at each function’s documentation for more details.

The ORDER BY clause has an effect on the order-sensitive window function, which comes in two types:

  • Rank-related functions: These functions calculate values based on the rank of the row, which is the order of the row in the window. They are used to make explicit, that is, crystallized into a value that can be used later, one of many possible orderings of a row in a window.

One example is the ordering of teams in a championship based on points or medals in a sport: you can have different styles of ordering based on what is decided when you have the same value.

As an example, if two teams (A and B) have the same score, after another team (C) with an higher score, you can say that A and B are both second or you can say that one is second and the other one third.

  • Window frame functions: These functions work on a window frame, which is a dynamic subset of rows of a window. As we saw previously, they can use cumulative or sliding window frames. The next section will delve into these.

Window frame definition

Some functions can use a dynamic window frame instead of a fixed window.

You specify the window frame inside the OVER clause, after the ORDER BY clause of the window:

<function_name> ([<arguments>])
    OVER ( [PARTITION BY <part_expr>]
           ORDER BY <order_expr>
           <cumulative_frame_def> | <sliding_frame_def>
         )

A cumulative window frame is a subset of the window that is either growing or shrinking based on the current row inside the window. The frame contains the rows of the window either before or after the current row. The syntax of the cumulative frame definition is the two following lines:

  {ROWS | RANGE} BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
| {ROWS | RANGE} BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING

The first expression takes all the rows in the partition or all the rows in the same order range up to the current one, while the second takes from the current row to the end of the partition or range. A better description of ROWS and RANGE follows.

A sliding window frame is a subset of rows that extends from a number of N rows before the current row to a number of M rows after the current row. One or both sides can be fixed to the start or end of the partition.

The following are the three possible forms of the syntax:

  ROWS BETWEEN <N> { PRECEDING | FOLLOWING }
       AND <M> { PRECEDING | FOLLOWING }
| ROWS BETWEEN UNBOUNDED PRECEDING AND <M> { PRECEDING | FOLLOWING }
| ROWS BETWEEN <N> { PRECEDING | FOLLOWING } AND UNBOUNDED FOLLOWING

The first syntax is when you want to create a frame that extends from N rows before to M rows after the current rows, while the other two are the cases when one of the two extremities is fixed to be the start or end of the window.

ROWS/RANGE in a cumulative window definition

When specifying a cumulative window frame, we can use either ROWS or RANGE, but what is the difference?

  • ROWS: Indicates that the fixed side of the frame extends to the limit of the partition: from the start of the partition to the current row or from the current row to the end of the partition.
  • RANGE: Indicates that the fixed side of the frame extends to the previous/following rows that have the same value for the ORDER BY value. The frame operates on the preceding/following rows that are in the same position in the ordering. As an example, you can think of stepping through many teams that have the same number of points in a championship list.

Important note

Providing extensive information on window functions is beyond the scope of this introduction to SQL used in analytical operations. It is also one area where SQL is not always the same in all databases. We suggest searching for “window functions” in the documentation of your database.

Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Data Engineering with dbt
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon