Book Image

Learning PostgreSQL 10 - Second Edition

Book Image

Learning PostgreSQL 10 - Second Edition

Overview of this book

PostgreSQL is one of the most popular open source databases in the world, supporting the most advanced features included in SQL standards. This book will familiarize you with the latest features released in PostgreSQL 10. We’ll start with a thorough introduction to PostgreSQL and the new features introduced in PostgreSQL 10. We’ll cover the Data Definition Language (DDL) with an emphasis on PostgreSQL, and the common DDL commands supported by ANSI SQL. You’ll learn to create tables, define integrity constraints, build indexes, and set up views and other schema objects. Moving on, we’ll cover the concepts of Data Manipulation Language (DML) and PostgreSQL server-side programming capabilities using PL/pgSQL. We’ll also explore the NoSQL capabilities of PostgreSQL and connect to your PostgreSQL database to manipulate data objects. By the end of this book, you’ll have a thorough understanding of the basics of PostgreSQL 10 and will have the necessary skills to build efficient database solutions.
Table of Contents (23 chapters)
Title Page
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

Relational and object relational databases


Relational database management systems are one of the most widely-used DBMSs in the world. It is highly unlikely that any organization, institution, or personal computer today does not have or use a piece of software that rely on RBDMS. Software applications can use relational databases via dedicated database servers or via lightweight RDBMS engines, embedded in the software applications as shared libraries. The capabilities of a relational database management system vary from one vendor to another, but most of them adhere to the ANSI SQL standards. A relational database is formally described by relational algebra, and is based on the relational model. Object-relational database (ORD) are similar to relational databases. They support the following object-oriented model concepts:

  • User-defined and complex data types
  • Inheritance

ACID properties

In a relational database, a single logical operation is called a transaction. The technical translation of a transaction is a set of database operations, which are createread, update, and delete (CRUD). An example of explaining a transaction is budget assignment to several projects in the company assuming we have a fixed amount of money. If we increase a certain project budget, we need to deduct this amount of increase from another project. The ACID properties in this context could be described as follows:

  • Atomicity: All or nothing, which means that if a part of a transaction fails, then the transaction fails as a whole.
  • Consistency: Any transaction gets the database from one valid state to another valid state. Database consistency is governed normally by data constraints and the relation between data and any combination thereof. For example, imagine if one would like to completely purge his account on a shopping service. In order to purge his account, his account details, such as a list of addresses, will also need to be purged. This is governed by foreign key constraints, which will be explained in detail in the coming chapter.
  • Isolation: Concurrent execution of transactions results in a system state that would be obtained if the transactions were executed serially.
  • Durability: The transactions that are committed--that is, executed successfully--are persistent even with power loss or some server crashes. In PostgreSQL, this is done normally by a technique called write-ahead log (WAL). Other database refers to this as a transaction log such as in Oracle.

The SQL language

Relational databases are often linked to the structured query language (SQL). SQL is a declarative programming language and is the standard relational database language. The American National Standard Institute (ANSI) and the International Standard Organization (ISO) published the SQL standard for the first time in 1986, followed by many versions such as SQL:1999, SQL:2003, SQL:2006, SQL:2008, SQL:2011, and SQL:2016. The SQL language has several parts:

  • Data definition language (DDL): It defines and amends the relational structure
  • Data manipulation language (DML): It retrieves and extracts information from the relations
  • Data control language (DCL): It controls the access rights to relations

Relational model concepts

A relational model is a first-order predicate logic, which was first introduced by Edgar F. Codd in 1970 in his paper A relational model of data for large shared data banks. A database is represented as a collection of relations. The state of the whole database is defined by the state of all the relations in the database. Different information can be extracted from the relations by joining and aggregating data from different relations and by applying filters on the data. In this section, the basic concepts of the relational model are introduced using the top-down approach by first describing the relation, tuple, attribute, and domain.

Note

The terms relation, tuple, attribute, and unknown, which are used in the formal relational model, are equivalent to table, row, column, and null in the SQL language.

Relation

Think of a relation as a table with a header, columns, and rows. The table name and the header help in interpreting the data in the rows. Each row represents a group of related data, which points to a certain object.

A relation is represented by a set of tuples. Tuples should have the same set of ordered attributes. Attributes have a domain, that is, a type and a name:

customer_id

first_name

last_name

email

Tuple →

1

thomas

sieh

[email protected]

Tuple →

2

wang

kim

[email protected]

Attribute ↑

Attribute ↑

Attribute ↑

Attribute ↑

 

The relation schema is denoted by the relation name and the relation attributes. For example, customer (customer_id, first_name, last_name, and email) is the relation schema for the customer relation. Relation state is defined by the set of relation tuples; thus, adding, deleting, and amending a tuple will change the relation to another state.

Tuple order or position in the relation is not important, and the relation is not sensitive to tuple order. The tuples in the relation could be ordered by a single attribute or a set of attributes. Also, a relation cannot have duplicate tuples.

A relation can represent entities in the real world, such as a customer, or can be used to represent an association between relations. For example, the customer could have several services and a service can be offered to several customers. This could be modeled by three relations: customer, service, and customer_service. The customer_service relation associates the customer and the service relations. Separating the data in different relations is a key concept in relational database modeling, which is called normalization. Normalization is the process of organizing relation columns and relations to reduce data redundancy. For example, assume that a collection of services is stored in the customer relation. If a service is assigned to multiple customers, this would result in data redundancy. Also, updating a certain service would require updating all its copies in the customer table.

Tuple

A tuple is a set of ordered attributes. They are written by listing the elements within parentheses () and separated by commas, such as (john, smith, 1971). Tuple elements are identified via the attribute name. Tuples have the following properties:

  • (a1,a2, a3,…,an) = (b1, b2,b3,…,bn ) if and only if a1= b1, a2=b2, …,an= bn
  • A tuple is not a set; the order of attributes matters as well as duplicate members 
    • (a1, a2) ≠(a2, a1)
    • (a1, a1) ≠(a1)
  • A tuple has a finite set of attributes

In the formal relational model, multi-valued attributes as well as composite attributes are not allowed. This is important to reduce data redundancy and increase data consistency. This isn't strictly true in modern relational database systems because of the utilization of complex data types such as JSON and key-value stores.

Note

There is a lot of debate regarding the application of normalization; the rule of thumb is to apply normalization unless there is a good reason not to do so.

NULL value

Predicates in relational databases use three-valued logic (3VL), where there are three truth values: true, false, and unknown aka NULL. In a relational database, the third value, NULL, can be interpreted in many ways, such as unknown data, missing data, not applicable, or will be loaded later. The three-valued logic is used to remove ambiguity. For example, no two NULL values are equal.

The following table shows logical OR / AND truth operator; note that these operators are commutative, that is, A AND B = B AND A:

A

B

A AND B

A OR B

TRUE

TRUE

TRUE

TRUE

FALSE

TRUE

FALSE

TRUE

NULL

TRUE

NULL

TRUE

FALSE

FALSE

FALSE

FALSE

NULL

FALSE

FALSE

NULL

NULL

NULL

NULL

NULL

 

The following table shows the NOT truth operator:

A

NOT A

TRUE

FALSE

FALSE

TRUE

NULL

NULL

Attribute

Each attribute has a name and a domain, and the name should be distinct within the relation. The domain defines the possible set of values that the attribute can have. One way to define the domain is to define the data type and a constraint on this data type. For example, hourly wage should be a positive real number and bigger than five if we assume that the minimum hourly wage is five dollars. The domain could be continuous, such as salary, which is any positive real number, or discrete, such as gender.

The formal relational model puts a constraint on the domain: the value should be atomic. Atomic means that each value in the domain is indivisible. For instance, the name attribute domain is not atomic because it can be divided into first name and last name. Some examples of domains are as follows:

  • Phone number: Numeric text with a certain length.
  • Country code: Defined by ISO 3166 as a list of two letter codes (ISO alpha-2) and three letter codes (ISO alpha-3). The country codes for Germany are DE and DEU for alpha-2 and alpha-3 respectively.

Note

In real-life applications, it is better to use ISO and international standards for lookup tables such as country and currency. This enables you to expose your data much easily for third-party software and increases your data quality.

Constraint

The relational model defines many constraints in order to control data integrity, redundancy, and validity:

  • Redundancy: Duplicate tuples are not allowed in the relation.
  • Validity: Domain constraints control data validity.
  • Integrity: The relations within a single database are linked to each other. An action on a relation such as updating or deleting a tuple might leave the other relations in an invalid state.

We could classify the constraints in a relational database roughly into two categories:

  • Inherited constraints from the relational model: Domain integrity, entity integrity, and referential integrity constraints.
  • Semantic constraint, business rules, and application specific constraints: These constraints cannot be expressed explicitly by the relational model. However, with the introduction of procedural SQL languages such as PL/pgsql for PostgreSQL, relational databases can also be used to model these constraints.

Domain integrity constraint

The domain integrity constraint ensures data validity. The first step in defining the domain integrity constraint is to determine the appropriate data type. The domain data types could be integer, real, boolean, character, text, inet, and so on. For example, the data type of first name and email address is text. After specifying the data type, check constraints, such as the mail address pattern, need to be defined.

  • Check constraint: A check constraint can be applied to a single attribute or a combination of many attributes in a tuple. Let's assume that the customer_service schema is defined as customer_id, service_id, start_date, end_date, order_date. For this relation, we can have a check constraint to make sure that start_date and end_date are entered correctly by applying the following check start_date<end_date.
  • Default constraint: The attribute can have a default value. The default value could be a fixed value such as the default hourly wage of the employees, for example, $10. It may also have a dynamic value based on a function such as random, current time, and date. For example, in the customer_service relation, order_date can have a default value, which is the current date.
  • Unique constraint: A unique constraint guarantees that the attribute has a distinct value in each tuple. It allows null values. For example, let's assume that we have a relation player defined as player (player_id, player_nickname). The player uses his ID to play with others; he can also pick up a nickname which is also unique to identify himself.
  • Not null constraint: By default, the attribute value can be null. The not null constraint restricts an attribute from having a null value. For example, each person in the birth registry record should have a name.

Entity integrity constraint

In the relational model, a relation is defined as a set of tuples. This means that all the tuples in a relation must be distinct. The entity integrity constraint is enforced by having a primary key which is an attribute/set of attributes having the following characteristics:

  • The attribute should be unique
  • The attributes should be not null

Each relation must have only one primary key, but can have many unique keys. A candidate key is a minimal set of attributes that can identify a tuple. All unique, not null attributes can be candidate keys. The set of all attributes form a super key. In practice, we often pick up a single attribute to be a primary key instead of a compound key ( a key that consists of two or more attributes that uniquely identify a tuple) to ease the joining of the relations with each other.

If the primary key is generated by the DBMS, then it is called a surrogate key or synthetic key . Otherwise, it is called a natural key. The surrogate key candidates can be sequences and universal unique identifiers (UUID). A surrogate key has many advantages such as performance, requirement change tolerance, agility, and compatibility with object relational mappers. The chief disadvantage of surrogate keys is that , it makes redundant tuples possible. 

Referential integrity constraints

Relations are associated with each other via common attributes. Referential integrity constraints govern the association between two relations and ensure data consistency between tuples. If a tuple in one relation references a tuple in another relation, then the referenced tuple must exist. In the customer service example, if a service is assigned to a customer, then the service and the customer must exist, as shown in the following example. For instance, in the customer_service relation, we cannot have a tuple with values (5, 1,01-01-2014, NULL), because we do not have a customer with customer_id equal to 5.

The lack of referential integrity constraints can lead to many problems:

  • Invalid data in the common attributes
  • Invalid information during joining of data from different relations
  • Performance degradation either due to bad execution plans generated by the PostgreSQL planner or by a third-party tool.

Note

Foreign keys can increase performance in reading data from multiple tables. The query execution planner will have a better estimation of the number of rows that need to be processed. Disabling foreign keys when doing a bulk insert will lead to a performance boost

Referential integrity constraints are achieved via foreign keys. A foreign key is an attribute or a set of attributes that can identify a tuple in the referenced relation. As the purpose of a foreign key is to identify a tuple in the referenced relation, foreign keys are generally primary keys in the referenced relation. Unlike a primary key, a foreign key can have a null value. It can also reference a unique attribute in the referenced relation. Allowing a foreign key to have a null value enables us to model different cardinality constraints. Cardinality constraints define the participation between two different relations. For example, a parent can have more than one child; this relation is called one-to-many relationship, because one tuple in the referenced relation is associated with many tuples in the referencing relation. Also, a relation could reference itself. This foreign key is called a self-referencing or recursive foreign key.

For example, a company acquired by another company:

To ensure data integrity, foreign keys can be used to define several behaviors when a tuple in the referenced relation is updated or deleted. The following behaviors are called referential actions:

  • Cascade: When a tuple is deleted or updated in the referenced relation, the tuples in the referencing relation are also updated or deleted
  • Restrict: The tuple cannot be deleted or the referenced attribute cannot be updated if it is referenced by another relation
  • No action: Similar to restrict, but it is deferred to the end of the transaction
  • Set default: When a tuple in the referenced relation is deleted or the referenced attribute is updated, then the foreign key value is assigned the default value
  • Set null: The foreign key attribute value is set to null when the referenced tuple is deleted

Semantic constraints

Integrity constraints or business logic constraints describe the database application constraints in general. These constraints are either enforced by the business logic tier of the application program or by SQL procedural languages. Trigger and rule systems can also be used for this purpose. For example, the customer should have at most one active service at a time. Based on the nature of the application, one could favor using an SQL procedural language or a high-level programming language to meet the semantic constraints, or mix the two approaches.

The advantages of using the SQL programming language are as follows:

  • Performance: RDBMSs often have complex analyzers to generate efficient execution plans. Also, in some cases such as data mining, the amount of data that needs to be manipulated is very large. Manipulating the data using procedural SQL language eliminates the network data transfer. Finally, some procedural SQL languages utilize clever caching algorithms.
  • Last minute change: For the SQL procedural languages, one could deploy bug fixes without service disruption.

Note

Implementing business logic in database tier has a lot of pros and cons and it is a highly arguable topic. For example, some disadvantages of implementing business logic in the database is visibility, developers efficiency in writing code due to a lack of proper tools and IDEs, and code reuse.