Book Image

Learning PostgreSQL 11 - Third Edition

By : Salahaldin Juba, Andrey Volkov
Book Image

Learning PostgreSQL 11 - Third Edition

By: Salahaldin Juba, Andrey Volkov

Overview of this book

PostgreSQL is one of the most popular open source database management systems in the world, and it supports advanced features included in SQL standards. This book will familiarize you with the latest features in PostgreSQL 11, and get you up and running with building efficient PostgreSQL database solutions from scratch. Learning PostgreSQL, 11 begins by covering the concepts of relational databases and their core principles. You’ll explore the Data Definition Language (DDL) and commonly used DDL commands supported by ANSI SQL. You’ll also learn how to create tables, define integrity constraints, build indexes, and set up views and other schema objects. As you advance, you’ll come to understand Data Manipulation Language (DML) and server-side programming capabilities using PL/pgSQL, giving you a robust background to develop, tune, test, and troubleshoot your database application. The book will guide you in exploring NoSQL capabilities and connecting to your database to manipulate data objects. You’ll get to grips with using data warehousing in analytical solutions and reports, and scaling the database for high availability and performance. By the end of this book, you’ll have gained a thorough understanding of PostgreSQL 11 and developed the necessary skills to build efficient database solutions.
Table of Contents (18 chapters)

Relational and object-relational databases

Relational DBMS are one of the most widely used DBMSes in the world. It is highly unlikely that any organization, institution, or personal computer today does not have or use a piece of software that relies on RDBMS.

Software applications can use relational databases via dedicated database servers or via lightweight RDBMS engines, embedded in the software applications as shared libraries.

The capabilities of a relational database management system vary from one vendor to another, but most of them adhere to the American National Standards Institute (ANSI) SQL standards. A relational database is formally described by relational algebra, and is based on the relational model. Object-relational databases (ORDs) are similar to relational databases. They support the following object-oriented model concepts:

  • User-defined and complex data types
  • Inheritance

ACID properties

In a relational database, a single logical operation is called a transaction. The technical translation of a transaction is a set of database operations, which are create, read, update, and delete (CRUD). An example of explaining a transaction is a budget assignment to several projects in the company assuming we have a fixed amount of money. If we increase a certain project budget, we need to deduct this amount of increase from another project. The ACID properties in this context could be described as follows:

  • Atomicity: All or nothing, which means that if a part of a transaction fails, then the transaction fails as a whole.
  • Consistency: Any transaction gets the database from one valid state to another valid state. Database consistency is governed normally by data constraints and the relation between data and any combination thereof. For example, imagine if someone would like to completely purge his account on a shopping service. In order to purge his account, his account details, such as a list of addresses, will also need to be purged. This is governed by foreign key constraints, which will be explained in detail in the coming chapter.
  • Isolation: Concurrent execution of transactions results in a system state that would be obtained if the transactions were executed serially.
  • Durability: The transactions that are committed—that is, executed successfully—are persistent even with power loss or some server crashes. In PostgreSQL, this is done normally by a technique called Write-Ahead Logging (WAL). Another database refers to this as a transaction log such as in Oracle.

The SQL language

Relational databases are often linked to Structured Query Language (SQL). SQL is a declarative programming language and is the standard relational database language. The ANSI and the International Organization for Standardization (ISO) published the SQL standard for the first time in 1986, followed by many versions such as SQL:1999, SQL:2003, SQL:2006, SQL:2008, SQL:2011, and SQL:2016.

The SQL language has several parts:

  • Data definition language (DDL): It defines and amends the relational structure
  • Data manipulation language (DML): It retrieves and extracts information from the relations
  • Data control language (DCL): It controls the access rights to relations

Relational model concepts

A relational model is a first-order predicate logic that was first introduced by Edgar F. Codd in 1970 in his paper A Relational Model of Data for Large Shared Data Banks. A database is represented as a collection of relations. The state of the whole database is defined by the state of all the relations in the database. Different information can be extracted from the relations by joining and aggregating data from different relations and by applying filters on the data. In this section, the basic concepts of the relational model are introduced using the top-down approach by first describing the relation, tuple, attribute, and domain.

The terms relation, tuple, attribute, and unknown, which are used in the formal relational model, are equivalent to table, row, column, and null in the SQL language.

Relation

Think of a relation as a table with a header, columns, and rows. The table name and the header help in interpreting the data in the rows. Each row represents a group of related data, which points to a certain object.

A relation is represented by a set of tuples. Tuples should have the same set of ordered attributes. Attributes have a domain, that is, a type and a name:

customer_id

first_name

last_name

email

Tuple →

1

thomas

sieh

[email protected]

Tuple →

2

wang

kim

[email protected]

Attribute ↑

Attribute ↑

Attribute ↑

Attribute ↑

The relation schema is denoted by the relation name and the relation attributes. For example, customer (customer_id, first_name, last_name, and email) is the relation schema for the customer relation. Relation state is defined by the set of relation tuples; thus, adding, deleting, and amending a tuple will change the relation to another state.

Tuple order or position in the relation is not important, and the relation is not sensitive to tuple order. The tuples in the relation could be ordered by a single attribute or a set of attributes. Also, a relation cannot have duplicate tuples.

A relation can represent entities in the real world, such as a customer, or can be used to represent an association between relations. For example, the customer could have several services and a service can be offered to several customers. This could be modeled by three relations: customer, service, and customer_service. The customer_service relation associates the customer and the service relations. Separating the data in different relations is a key concept in relational database modeling, and is called normalization. Normalization is the process of organizing relation columns and relations to reduce data redundancy. For example, assume that a collection of services is stored in the customer relation. If a service is assigned to multiple customers, this would result in data redundancy. Also, updating a certain service would require updating all its copies in the customer table.

Tuple

A tuple is a set of ordered attributes. They are written by listing the elements within parentheses () and separated by commas, such as (john, smith, 1971). Tuple elements are identified via the attribute name. Tuples have the following properties:

  • (a1,a2, a3,…,an) = (b1, b2,b3,…,bn ) if and only if a1= b1, a2=b2, …,an= bn
  • A tuple is not a set; the order of attributes matters as well as duplicate members:
    • (a1, a2) ≠(a2, a1)
    • (a1, a1) ≠(a1)
  • A tuple has a finite set of attributes

In the formal relational model, multi-valued attributes, as well as composite attributes, are not allowed. This is important to reduce data redundancy and increase data consistency. This isn't strictly true in modern relational database systems because of the utilization of complex data types such as JSON and key-value stores.

There is a lot of debate regarding the application of normalization; the rule of thumb is to apply normalization unless there is a good reason not to do so.

The null value

Predicates in relational databases use three-valued logic (3VL), where there are three truth values: true, false, and null,. In a relational database, the third value, null, can be interpreted in many ways, such as unknown data, missing data, not applicable, or will be loaded later. The 3VL is used to remove ambiguity. For example, no two null values are equal.

In the next chapter, you will learn how to connect to the database and run queries. Now, I would like to show how a logical OR/AND truth table can be generated by the SQL language:

Logical AND and OR operators are commutative, that is, A AND B = B AND A.
\pset null null
WITH data (v) as (VALUES (true), (false),(null))
SELECT DISTINCT
first.v::TEXT as a,
second.v::TEXT as b,
(first.v AND second.v)::TEXT AS "a and b",
(first.v OR second.v)::TEXT as "a or b"
FROM
data as first cross join
data as second
ORDER BY a DESC nulls last, b DESC nulls last;
a | b | a and b | a or b
-------+-------+---------+--------
true | true | true | true
true | false | false | true
true | null | null | true
false | true | false | true
false | false | false | false
false | null | false | null
null | true | null | true
null | false | false | null
null | null | null | null
(9 rows)

The following table, which is generated by SQL, shows the NOT truth operator:

WITH data (v) as (VALUES (true), (false),(null)) SELECT v::text as a, (NOT v)::text as "NOT a" FROM data order by a desc nulls last;
a | NOT a
-------+-------
true | false
false | true
null | null
(3 rows)

Attribute

Each attribute has a name and a domain, and the name should be distinct within the relation. The domain defines the possible set of values that the attribute can have. One way to define the domain is to define the data type and a constraint on this data type. For example, the hourly wage should be a positive real number and bigger than five if we assume that the minimum hourly wage is five dollars. The domain could be continuous, such as salary, which is any positive real number, or discrete, such as gender.

The formal relational model puts a constraint on the domain: the value should be atomic. Atomicity means that each value in the domain is indivisible. For instance, the name attribute domain is not atomic because it can be divided into first name and last name. Some examples of domains are as follows:

  • Phone number: Numeric text with a certain length.
  • Country code: Defined by ISO 3166 as a list of two-letter codes (ISO alpha-2) and three-letter codes (ISO alpha-3). The country codes for Germany are DE and DEU for alpha-2 and alpha-3 respectively.
In real-life applications, it is better to use ISO and international standards for lookup tables such as country and currency. This enables you to expose your data much more easy to third-party software and increases your data quality.

Constraint

The relational model defines many constraints in order to control data integrity, redundancy, and validity. Here are some examples of checking for data:

  • Redundancy: Duplicate tuples are not allowed in the relation.
  • Validity: Check constraints and domain constraints are used to validate the data input, for example, the date of birth should be a date that occurred in the past.
  • Integrity: The relations within a single database are linked to each other. An action on a relation such as updating or deleting a tuple might leave the other relations in an invalid state.

We could classify the constraints in a relational database roughly into two categories:

  • Inherited constraints from the relational model: Domain integrity, entity integrity, and referential integrity constraints.
  • Semantic constraint, business rules, and application-specific constraints: These constraints cannot be expressed explicitly by the relational model. However, with the introduction of procedural SQL languages such as PL/pgSQL for PostgreSQL, relational databases can also be used to model these constraints.

Domain integrity constraint

The domain integrity constraint ensures data validity. The first step in defining the domain integrity constraint is to determine the appropriate data type. The domain data types could be an integer, real, boolean, character, text, inet, and so on. For example, the data type of the first name and email address is text. After specifying the data type, check constraints, such as the mail address pattern, need to be defined:

  • Check constraint: A check constraint can be applied to a single attribute or a combination of many attributes in a tuple. Let's assume that the customer_service schema is defined as customer_id, service_id, start_date, end_date, and order_date. For this relation, we can have a check constraint to make sure that start_date and end_date are entered correctly by applying the following check: start_date is less than end_date.
  • Default constraint: The attribute can have a default value. The default value could be a fixed value such as the default hourly wage of the employees, for example, $10. It may also have a dynamic value based on a function such as random, current time, and date. For example, in the customer_service relation, order_date can have a default value, which is the current date.
  • Unique constraint: A unique constraint guarantees that the attribute has a distinct value in each tuple. It allows null values. For example, let's assume that we have a relation player defined as a player (player_id, player_nickname). The player uses his ID to play with others; he can also pick up a nickname, which is also unique to identify himself.
  • Not null constraint: By default, the attribute value can be null. The not null constraint prevents an attribute from having a null value. For example, each person in the birth registry record should have a name.

Entity integrity constraint

In the relational model, a relation is defined as a set of tuples. This means that all the tuples in a relation must be distinct. The entity integrity constraint is enforced by having a primary key, which is an attribute/set of attributes with the following characteristics:

  • The attribute should be unique
  • The attributes should be not null

Each relation must have only one primary key, but can have many unique keys. A candidate key is a minimal set of attributes that can identify a tuple. All unique, not null attributes can be candidate keys. The set of all attributes form a super key. In practice, we often pick up a single attribute to be a primary key instead of a compound key (a key that consists of two or more attributes that uniquely identify a tuple) to simplify the joining of the relations with each other.

If the primary key is generated by the DBMS, then it is called a surrogate key or synthetic key. Otherwise, it is called a natural key. The surrogate key candidates can be sequences and universal unique identifiers (UUIDs). A surrogate key has many advantages such as performance, requirement change tolerance, agility, and compatibility with object-relational mappers. The chief disadvantage of surrogate keys is that it makes redundant tuples possible.

A sequence is a number generator that is used to generate a series of numbers based on the current number's value. This term is used mainly in PostgreSQL and Oracle databases. PostgreSQL also has an identity column, which is mainly used to generate series of numbers. More about this topic is explained in Chapter 4, PostgreSQL Advanced Building Blocks.

Referential integrity constraints

Relations are associated with each other via common attributes. Referential integrity constraints govern the association between two relations and ensure data consistency between tuples. If a tuple in one relation references a tuple in another relation, then the referenced tuple must exist. In the customer service example, if a service is assigned to a customer, then the service and the customer must exist, as shown in the following example:

For instance, in the customer_service relation, we cannot have a tuple with values (5, 1,01-01-2014, NULL), because we do not have a customer with customer_id equal to 5.

The lack of referential integrity constraints can lead to many problems:

  • Invalid data in the common attributes
  • Invalid information during joining of data from different relations
  • Performance degradation either due to bad execution plans generated by the PostgreSQL planner or by a third-party tool
Foreign keys can increase performance in reading data from multiple tables. The query execution planner will have a better estimation of the number of rows that need to be processed. Temporarily disabling foreign keys in special cases such as bulk uploading will lead to a performance boost, since integrity checks are not performed.

Referential integrity constraints are achieved via foreign keys. A foreign key is an attribute or a set of attributes that can identify a tuple in the referenced relation. As the purpose of a foreign key is to identify a tuple in the referenced relation, foreign keys are generally primary keys in the referenced relation. Unlike a primary key, a foreign key can have a null value. It can also reference a unique attribute in the referenced relation. Allowing a foreign key to have a null value enables us to model different cardinality constraints. Cardinality constraints define the participation between two different relations. For example, a parent can have more than one child; this relation is called a one-to-many relationship because one tuple in the referenced relation is associated with many tuples in the referencing relation. Also, a relation could reference itself. This foreign key is called a self-referencing or recursive foreign key.

For example, a company acquired by another company:

To ensure data integrity, foreign keys can be used to define several behaviors when a tuple in the referenced relation is updated or deleted. The following behaviors are called referential actions:

  • Cascade: When a tuple is deleted or updated in the referenced relation, the tuples in the referencing relation are also updated or deleted
  • Restrict: The tuple cannot be deleted or the referenced attribute cannot be updated if it is referenced by another relation
  • No action: Similar to restrict, but it is deferred to the end of the transaction
  • Set default: When a tuple in the referenced relation is deleted or the referenced attribute is updated, then the foreign key value is assigned the default value
  • Set null: The foreign key attribute value is set to null when the referenced tuple is deleted

Semantic constraints

Integrity constraints or business logic constraints describe the database application constraints in general. These constraints are either enforced by the business logic tier of the application program or by SQL procedural languages. Trigger and rule systems can also be used for this purpose. For example, the customer should have at most one active service at a time. Based on the nature of the application, one could favor using an SQL procedural language or a high-level programming language to meet the semantic constraints, or mix the two approaches.

The advantages of using the SQL programming language are as follows:

  • Performance: RDBMSes often have complex analyzers to generate efficient execution plans. Also, in some cases such as data mining, the amount of data that needs to be manipulated is very large. Manipulating the data using procedural SQL languages eliminates the network data transfer. Finally, some procedural SQL languages utilize clever caching algorithms.
  • Last-minute change: For SQL procedural languages, one could deploy bug fixes without service disruption.
Implementing business logic in the database tier has a lot of pros and cons and it is a highly contentious topic. For example, some disadvantages of implementing business logic in the database are visibility, developer efficiency in writing code due to a lack of proper tools and IDEs, and code reuse.