-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Data Engineering with dbt
By :
The SQL language is certainly big and complex, taking time to be fully mastered, but surprisingly, you can get productive with the limited set of features that we will introduce in this book, and you will pick up the nuances when you’ll need them while working on real use cases.
The goal of this chapter is not to make you a SQL expert but to get you started with it, getting deeper into the few commands that we will use often to make you quickly productive with dbt.
In this intro, we start with the most basic concepts of SQL and then provide an overview of the SQL commands by their categories of use. In the rest of this chapter, we will dig deeper into the commands that are mostly used when working with dbt: SELECT, JOIN, and analytical and window functions.
The main goal of SQL, as its name implies, is to allow users to query data that is contained in a database; SQL also provides all the commands to fully manage the database, allowing you to add, transform, delete, organize, and manage data and other database objects, such as users and roles.
The core concepts in SQL come from how an RDB is organized:
You can see a database with schemata and tables in the following screenshot, which shows part of the sample database available in any Snowflake account:
Figure 1.1: Example database with schemata and tables
Let’s go through the core concepts of SQL, starting with the table.
The table is the most central concept in SQL, as it is the object that contains data.
A table in SQL is very close to the layman’s concept of a table, with data organized in columns and rows.
Columns represent the attributes that can be stored in a table, such as a customer’s name or the currency of an order, with each column defining the type of data it can store, such as text, a number, or a date.
In a table, you can have as many rows as you want, and you can keep adding more whenever you need. Each row stores in the table one instance of the concept that the table represents, such as a specific order in the following example of a table for orders:
|
Order_ID |
Customer_CODE |
Total_amount |
Currency |
|
123 |
ABC |
100 |
EUR |
|
166 |
C099 |
125,15 |
USD |
Table 1.1: Example order table
Looking at the previous example order table, we can see that the table has four columns that allow storing the four attributes for each row (order ID, customer code, total amount, and currency of the order). The table, as represented, has two rows of data, representing one order each.
The first row, in bold, is not data but is just a header to represent the column names and make the table easier to read when printed out.
In SQL, a table is both the definition of its content (columns and their types) and the content itself (the data, organized by rows):
null if no value is provided.null is a special value that corresponds to the absence of data and is compatible with all data types.When creating a table, we must provide a name for the table and its definition, which consists of at least column names and a type for each column; the data can be added at a different time.
In the following code block, we have a sample definition for an order table:
CREATE TABLE ORDERS ( ORDER_ID NUMBER, CUSTOMER_CODE TEXT, TOTAL_AMOUNT FLOAT, ORDER_DATE DATE, CURRENCY TEXT DEFAULT 'EUR' );
Tip
When we write some SQL, we will use Snowflake’s SQL syntax and commands. We will guide you on how to create a free Snowflake account where you can run the code shown here.
In this example, we see that the command to create a table reads pretty much like English and provides the name of the table and a list of columns with the type of data each column is going to contain.
You can see that for the CURRENCY column, of type text, this code also provides the default value EUR. Single quotes are used in SQL to delimit a piece of text, aka a string.
If you already have data and you want to make it available with some transformation or filtering, you can create a view. You can think of a view like a table, but with the column definition and data both coming from a query, reading from one or more tables.
As an example, if you would like to have a shortlist of orders with amount greater than 1,000 you could write the following query to create a BIG_ORDERS view:
CREATE VIEW BIG_ORDERS AS SELECT * FROM ORDERS WHERE TOTAL_AMOUNT > 1000;
In this example, we see that this simple create view statement provides the name for the view and uses a query, which is a SELECT statement, to define what data is made available by the view.
The query provides both the data, all the orders with a total amount greater than 1,000, and the column definitions. The * character – called star – is a shortcut for all columns in the tables read by the SELECT statement.
This is, of course, a naïve example, but throughout this book, you will see that combining tables that store data and views that filter and transform the data coming from tables and views is the bread and butter of working with dbt. Building one object on top of the previous allows us to take raw data as input and provide as output refined information that our users can easily access and understand.
Tip
When working with dbt, you will not need to write create table or create view statements, as dbt will create them for us. It is nevertheless good to get familiar with these basic SQL commands as these are the commands executed in the database and you will see them if you look in the logs.
We have seen that a database is a container for tables and views.
A DB can be further divided and organized using schema objects.
In real-life DBs, the number of tables can range from a few units to many thousands. Schemata act pretty much like folders, but cannot be nested, so you can always identify a table by its database, schema, and name.
In the following screenshot, we see part of the contents of the SNOWFLAKE_SAMPLE_DATA database, which is available in all Snowflake accounts:
Figure 1.2: Some schemata and tables of the SNOWFLAKE_SAMPLE_DATA database
The database and schemata, as table containers, are the main ways information is organized, but they are also used to apply security, access limitations, and other features in a hierarchical way, simplifying the management of big systems.
To create the TEST database and the SOME_DATA schema in it, we can use the following commands:
CREATE DATABASE TEST; CREATE SCHEMA TEST.SOME_DATA;
The database.schema notation, also known as a fully qualified name, allows us to precisely describe in which database to create the schema and after its creation, uniquely identifies the schema.
Tip
While working with dbt, you will create a database or use an existing one for your project; dbt will create the required schema objects for you if you have not created them already.
A best practice in Snowflake is to have one database for each project or set of data that you want to keep separate for administration purposes. Databases and schemata in Snowflake are soft boundaries, as all the data from all the databases and schemata can be accessed if the user has the appropriate privileges.
In some other database systems, such as PostgreSQL, a database is a stronger boundary.
To control access to your data in SQL, you GRANT access and other privileges to both users and roles.
A user represent one individual user or service that can access the database, while a role represents a named entity that can be granted privileges.
A role can be granted to a user, providing them all the privileges associated with the role.
A role can also be granted to another role, building hierarchies that use simple basic roles to build more complex roles that are assigned to a user.
Using roles instead of granting privileges directly to users allows you to manage even a large number of users simply and consistently. In this case, roles are labels for a set of privileges and become a way to manage groups of users that we want to grant the same privileges. Changing the privileges granted to a role at any moment will change the privileges that the users receive from that role.
A typical pattern when working with dbt is to create a role for the dbt users and then assign it to the developers and the service user that the dbt program will use.
The following is an example of a simple setup with one role and a couple of users:
CREATE ROLE DBT_SAMPLE_ROLE; CREATE USER MY_NAME; -- Personal user CREATE USER SAMPLE_SERVICE; -- Service user GRANT ROLE DBT_SAMPLE_ROLE TO USER MY_NAME; GRANT ROLE DBT_SAMPLE_ROLE TO USER SAMPLE_SERVICE;
A more complex setup could have one role to read and one to write for each source system (represented by a schema with the data from the system), for the data warehouse (one or more schemata where the data is processed), and for each data mart (one schema for each data mart).
You could then control in much more detail who can read and write what, at the cost of more effort.
SQL commands can be organized into categories according to their usage in the language:
INSERT, DELETE, and UPDATESELECT command and is the central part of SQL that allows querying and transforming the data stored in a databaseGRANT and REVOKE commands, which are used to manage the privileges that control the access to database resources and objectsIn the upcoming sections, we provide more details about these by looking at Snowflake-specific commands, but the ideas and names are of general use in all database systems, with little or no change.
DDL commands do not deal directly with the data but are used to create and maintain the structure and organization of the database, including creating the tables where the data is stored.
They operate on the following objects:
The main commands are as follows:
CREATE: Used to create the database itself and other objectsDROP: Used to delete the specified objectALTER: Used to modify some attribute of the specified objectDESC: Used to describe the details of the specified objectSHOW: Used to list the existing objects of the specified object type, with metadataUSE: Used to select the database, schema, and object to use when fully specified names are not usedTip
When working with dbt, we use the DDL and DML commands only in macros.
We do not use the DDL and DML commands in models because dbt will generate the required commands for our models based on the metadata attached to the model.
DML provides the commands to manipulate data in a database and carry out bulk data loading.
Snowflake also provides specific commands to stage files, such as loading files in a Snowflake-managed location, called a stage.
The main commands are as follows:
INSERT: Inserts rows into a tableDELETE: Removes specified rows from a tableUPDATE: Updates some values of specified rows in a tableMERGE: Inserts, updates, or deletes rows in a tableTRUNCATE TABLE: Empties a table, preserving the definition and privilegesFor bulk data loading, Snowflake provides the following command:
COPY INTO: Loads data from files in a stage into a table or unloads data from a table into one or more files in a stageTo manage files in stages, Snowflake provides these file-staging commands:
PUT: Uploads a local file to a Snowflake stageGET: Downloads a file from a stage to the local machineLIST: Lists the files existing in a Snowflake stageREMOVE: Removes a file from a Snowflake stageImportant note
In dbt, we can use macros with the COPY INTO and file-staging commands to manage the data-loading part of a data pipeline, when source data is in a file storage service such as AWS S3, Google Cloud Storage, or Microsoft Azure Data Lake file storage.
DQL is the reason why SQL exists: to query and transform data.
The command that is used to query data is SELECT, which is without any doubt the most important and versatile command in all of SQL.
For the moment, consider that a SELECT statement, aka a query, can do all these things:
Important note
We have dedicated the Query syntax and operators section later in this chapter to analyzing the SELECT command in Snowflake in detail, as you will use the SELECT command in every dbt model.
DCL contains the GRANT and REVOKE commands, which are used to manage privileges and roles that control access to or use database resources and objects.
Together with the DDL commands to create roles, users, and other database objects, the DCL commands are used to manage users and security:
GRANT: Assigns a privilege or a role to a role (or user)REVOKE: Removes a privilege or a role from a role (or user)SHOW GRANTS: Lists access privileges granted to a role or objectThe TCL commands are used to manage transactions in a database.
A transaction groups a set of SQL commands into a single execution unit and guarantees that either all the effects of all commands are applied, if the transaction completes with success, or no effect at all is applied if the transaction fails. This can also be described with the ACID acronym, which stands for atomic, consistent, isolated, and durable.
A transaction succeeds and ends only when all the commands it groups have finished with success; in any other case, the transaction fails, and its effects are rolled back like they never happened.
The TCL commands are as follows:
BEGIN: Starts a transaction in the current sessionCOMMIT: Commits an open transaction in the current session and makes the transaction effects visible to all sessionsROLLBACK: Rolls back an open transaction in the current session and undoes all the effects applied since the BEGIN statementNow that we have covered the basic concepts and commands in SQL, it is time to set up a database to run them. The next section will provide you with access to a Snowflake DB.
Change the font size
Change margin width
Change background colour