Book Image

Mastering PostgreSQL 11 - Second Edition

By : Hans-Jürgen Schönig
Book Image

Mastering PostgreSQL 11 - Second Edition

By: Hans-Jürgen Schönig

Overview of this book

This second edition of Mastering PostgreSQL 11 helps you build dynamic database solutions for enterprise applications using the latest release of PostgreSQL, which enables database analysts to design both the physical and technical aspects of the system architecture with ease. This book begins with an introduction to the newly released features in PostgreSQL 11 to help you build efficient and fault-tolerant PostgreSQL applications. You’ll examine all of the advanced aspects of PostgreSQL in detail, including logical replication, database clusters, performance tuning, monitoring, and user management. You will also work with the PostgreSQL optimizer, configuring PostgreSQL for high speed, and see how to move from Oracle to PostgreSQL. As you progress through the chapters, you will cover transactions, locking, indexes, and optimizing queries to improve performance. Additionally, you’ll learn to manage network security and explore backups and replications, while understanding the useful extensions of PostgreSQL so that you can optimize the speed and performance of large databases. By the end of this book, you will be able to use your database to its utmost capacity by implementing advanced administrative tasks with ease.
Table of Contents (15 chapters)
Free Chapter
1
PostgreSQL Overview

What's new in PostgreSQL 11.0?

PostgreSQL 11 was released in the fall of 2018 and provides users with a couple of modern features. These are useful to professionals and beginners alike. PostgreSQL 11 is the second major release following the new numbering scheme that was introduced by the PostgreSQL community. The next major release of PostgreSQL after version 11 will be 12. The service releases will be called PostgreSQL 11.1, 11.2, 11.3, and so on. Compared to the pre-10 world, this is a major change, which should be pointed out.

Which version should you use? The recommendation is to always use the most recent release. There is no point in getting started with, say, PostgreSQL 9.6 or so anymore. If you are new to PostgreSQL, begin with version 11. There is no such thing as bugs in PostgreSQL the community will always provide you with working code, so there is no need to be afraid of PostgreSQL 10 or PostgreSQL 11. It just works.

Understanding the new database administration functions

PostgreSQL 11 has many new features that can help the administrator reduce work and run the system more reliably and in a more robust way.

One of the features that is supposed to help people run even more efficient databases is the ability to configure the size of database instances, commonly known as WAL-segments.

Using configurable WAL-segment sizes

Since PostgreSQL was introduced 20 years ago, the size of a single WAL file has always been 16 MB. In the beginning, it was even compiled in limit, which was later changed to a compile-time option. Starting with PostgreSQL 11, the size of those WAL segments can be changed at instance creation, which gives administrators an additional knob to configure and optimize PostgreSQL. Here is how it works. The following example shows how to configure the WAL-segment size when running initdb:

initdb -D /pgdata --wal-segsize=32

The initdb command is the tool that is called to create a database instance. It is usually the call you see, although sometimes hidden by some operating system scripts that are provided by your favorite Linux distribution, Windows, or whatever you like to use. However, initdb now has an option to pass the desired WAL-segment size directly to the program.

As I have already mentioned, the default size is 16 MB; hence, in most cases, it makes sense to use larger segments to improve performance. There is no point in using smaller ones unless you are running a really, really small database instance on an embedded system.

What is the real performance impact going to be? As always, this really depends on what you are doing. If you are a running a database system facing 99% reads, the impact of larger WAL-segments will be zero. Yes, you heard that right – ZERO. If you are facing writes while your system is 95% idle and not under severe load, the impact will still be zero or close to zero. You will only be able to witness gains if you are running a heavy, write-intense workload. Then, and only then, might a change be worth it. If you are only running a couple of online forms that are visited by an occasional customer, why bother? This new feature will only show its strength when there are many changes leading to a lot of WAL.

Larger queryid in pg_stat_statements

If you really want to dig into PostgreSQL performance, pg_stat_statements is the tool to look at. Personally, I consider it to be the gold standard, that is, if you really want to figure out what is going on in the system. The pg_stat_statements command is loaded via shared_preload_libraries as soon as PostgreSQL starts and aggregates statistics about queries running in your server. It will instantly show you if something goes wrong.

The pg_stat_statements command provides a field called queryid, which was a 32-bit identifier up until now. In some cases, this has led to problems because it is possible that keys collide in certain cases. Magnus Hagander calculated in one of his papers that, after running 3 billion different queries, around 50,000 collisions could be expected. By introducing a 64-bit queryid, this number is expected to drop to around 0.25 conflicts after 3 billion different types of queries, which is a substantial improvement.

Keep in mind that you might have to update your scripts if you are moving to PostgreSQL 11, as well as if you are using pg_stat_statements to track down performance problems.

Improved indexing and better optimization

PostgreSQL 11 offers more than just a couple of improved functions to handle administration. There is also improved functionality around indexes. One of the most important features is related to indexes and statistics.

Expression index statistics

If you are running a simple query, PostgreSQL will optimize it by looking at internal statistics. Here is an example:

SELECT * FROM person WHERE gender = 'female';

In this case, PostgreSQL will consult the internal stats and estimate the number of girls in the table. If the number is low, PostgreSQL will consider an index. If the majority are female, PostgreSQL will consider a sequential scan instead of an index. Statistics are available per column. In addition, it is also possible to keep track of cross-column statistics, which have been introduced in PostgreSQL 10 (check out the CREATE STATISTICS command to find out more). The good news is that PostgreSQL will also keep track of statistics for functional indexes:

CREATE INDEX idx_cos ON t_data (cos(data));

What has not been possible so far is to use more sophisticated statistics on functional indexes.

Consider the following example of an index covering various columns:

CREATE INDEX coord_idx ON measured (x, y, (z + t)); 
ALTER INDEX coord_idx ALTER COLUMN 3 SET STATISTICS 1000;

In this case, there is an index on two columns, which also provides a virtual third column that's represented by the expression. This new feature allows us to create more statistics explicitly for the third column, which would otherwise be suboptimally covered. In my example, we will tell PostgreSQL explicitly that we want the third column to have 1,000 entries in the system statistics. This will allow the optimizer to come up with better estimates and therefore potentially create better plans. It will be a highly valuable contribution to the efficiency of some specialized applications.

INCLUDE indexes or covering indexes

Many other database systems have long provided a feature known as covering indexes. What does this mean? Consider the following example, which simply selects two columns from a table:

SELECT id, name FROM person WHERE id = 10;

Suppose we have an index on id only. In this case, PostgreSQL will look up the index and do a lookup in the table to fetch those additional fields. This is generally known as an index scan. It consists of checking the index and the underlying table to compose a row. The solution here used to be to create an index consisting of two columns. The idea is to allow PostgreSQL to perform an index-only scan instead of an index scan. If an index has all the columns that are needed, there is no need to do additional lookups in the table (for most cases).

Only select the columns you really need, otherwise you might trigger pointless table lookups. The following type of query is therefore generally assumed to be quite bad for performance: SELECT * FROM person WHERE id = 10;.

The problem here would be if you need a primary key constraint on the id and still want to end up triggering an index-only scan when reading an additional column. This is where the new feature steps in to save the day:

CREATE UNIQUE INDEX some_name ON person USING btree (id) INCLUDE (name);

PostgreSQL will ensure that the id is unique, but will still store additional fields in the index to trigger an index-only scan if asked for both columns. In a high-volume OLTP environment, this will increase performance dramatically. Of course, it is always hard to provide you with hard numbers because every table and every type of query is different. However, the gain can be absolutely substantial. PostgreSQL 11 will give us even more options to trigger even more index-only scans.

Parallel index creation

When an index is built in PostgreSQL, the database traditionally used one core to do the job. In many cases, this was not an issue. However, PostgreSQL is used for ever-growing systems and therefore index creation starts to be an issue in many cases. At the moment, the community is trying to improve sorting as well. The first step is therefore to allow for the parallel creation of btrees, which has made it into PostgreSQL 11. Future versions of PostgreSQL will also allow you to provide parallel sorts for normal operations, which is unfortunately not supported by PostgreSQL 11 yet.

Parallel index creation can speed up indexing dramatically, and we are eager to see future improvements in this area (maybe support for other index types, and so on).

Better cache management

PostgreSQL 11 will also provide you with better ways to manage the I/O cache (the shared buffers). The pg_prewarm command is especially noteworthy.

Improving pg_prewarm

The pg_prewarm command allows you to restore the content of the PostgreSQL I/O cache after a restart. It has already been around for quite some time and is widely used by the PostgreSQL user base. In PostgreSQL 11, pg_prewarm has been extended and allows for the automatic dumping of the buffer list in regular intervals.

It is also possible to automatically preload the old cache contents so that users will have better database performance after a restart. In particular, systems with a lot of RAM can benefit from these new improvements.

Enhancing windowing functions

Windowing functions and analytics are a cornerstone of any modern SQL implementation and are therefore widely used by professionals. PostgreSQL has provided support for windowing functions for quite some time now. However, were still some small features that were proposed by the SQL standard that were missing. PostgreSQL 11 now fully supports what SQL: 2011 proposes.

The following features have been added:

  • Range between:
    • Previously, just ROWS
    • Now handles values
  • Exclusion clauses:
    • Excludes the current row
    • Exclude ties

To demonstrate how these new features work, I have decided to include an example. The following code contains two windowing functions, and are as follows:

  • The first one uses what is already available in PostgreSQL 10 and previously.
  • The second array_agg excludes the current row, which is a new feature that's provided by PostgreSQL 11.

The following code generates five rows and contains two windowing functions:

test=# SELECT *,
array_agg(x) OVER (ORDER BY x ROWS BETWEEN
1 PRECEDING AND 1 FOLLOWING),
array_agg(x) OVER (ORDER BY x ROWS BETWEEN
1 PRECEDING AND 1 FOLLOWING EXCLUDE CURRENT ROW)
FROM generate_series(1, 5) AS x;
x | array_agg | array_agg
---+-----------+-----------
1 | {1,2} | {2}
2 | {1,2,3} | {1,3}
3 | {2,3,4} | {2,4}
4 | {3,4,5} | {3,5}
5 | {4,5} | {4}
(5 rows)

Excluding the current row is a pretty common requirement and should therefore not be underestimated.

Introducing just-in-time compilation

Just-in-time (JIT) compilation is really one of the highlights of PostgreSQL 11. A lot of infrastructure has been added to support even more JIT compilation in the future, and PostgreSQL 11 is the first release that makes full use of this modern technique. Before we dig into the details, what is JIT compilation all about? When running a query, a lot of stuff is actually only known at runtime and not at compile time (when PostgreSQL is compiled). Therefore, a traditional compiler is always at a disadvantage because it doesn't know what will happen at runtime. A JIT compiler already knows a lot more and can react accordingly.

Starting with PostgreSQL 11, you can make use of JIT compilation, which is especially useful for big queries. We will dig into the finer details in the later chapters of this book.

Enhanced partitioning

PostgreSQL 10 introduced the first version of partitioning in PostgreSQL. Of course, we used to have inheritance previously. However, PostgreSQL 10 was the first version that provided a modern version of doing things. PostgreSQL 11 will add some new functionality to this already powerful feature by introducing a couple of new highlights, such as the ability to create a default partition if none of the existing partitions match.

Here is how it works:

postgres=# CREATE TABLE default_part PARTITION OF some_table DEFAULT; 
CREATE TABLE

In this case, all the rows that simply don't match anywhere will end up in the default partition.

But there's more. In PostgreSQL, a row can't (easily) be moved from one partition to the other. Suppose you had one partition per country. If a person moved, say, from France to Estonia, you would not do that with a single UPDATE statement. You had to delete the old row and insert a new one. In PostgreSQL 11, this problem has been solved. Rows can now be moved from one partition to some other place in a totally transparent way.

PostgreSQL had many more shortcomings. In the old version, all partitions had to be indexed separately. There was no way to create a single index for all partitions. In PostgreSQL 11, an index that's added to the parent table will automatically make sure that all child tables are indexed too. This is really beneficial as it becomes less likely that indexes will simply be forgotten. Also, in PostgreSQL 11, you can actually add a global unique index. A partitioned table can therefore enforce a unique constraint.

Up until PostgreSQL 10, we had range partitioning and list partitioning. PostgreSQL 11 adds the ability to do hash partitioning. Here is an example showing how hash partitioning works:

test=# CREATE TABLE tab(i int, t text) PARTITION BY HASH (i); 
CREATE TABLE
test=# CREATE table tab_1 PARTITION OF tab
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE

There's not just more functionality. There's also a lot of new stuff to improve performance. Partition pruning is now a lot faster and PostgreSQL has the ability to consider partition-wise joins, as well as partition-wise aggregates, which is exactly what's needed for analytics and data warehousing.

Adding support for stored procedures

PostgreSQL has always supported functions, which were often referred to as stored procedures. However, there is a distinction between a stored procedure and a function. As I pointed out previously, up until PostgreSQL 10, we only had functions and no procedures.

The point is that a function is part of a larger structure, that is, a transaction. A procedure can contain more than just one transaction. Therefore, it cannot be called by a larger transaction and is a standalone thing.

Here is the syntax of CREATE PROCEDURE:

test=# \h CREATE PROCEDURE
Command: CREATE PROCEDURE
Description: define a new procedure
Syntax:
CREATE [ OR REPLACE ] PROCEDURE
    name ( [ [ argmode ] [ argname ] 
argtype [ { DEFAULT | = } default_expr ] [, ...] ] ) { LANGUAGE lang_name | TRANSFORM { FOR TYPE type_name } [, ... ] | [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER | SET configuration_parameter { TO value | = value | FROM CURRENT } | AS 'definition' | AS 'obj_file', 'link_symbol' } …

The following procedure shows how two transactions can be executed within the very same procedure:

test=# CREATE PROCEDURE test_proc()
       LANGUAGE plpgsql
AS $$
  BEGIN
    CREATE TABLE a (aid int);
    CREATE TABLE b (bid int);
    COMMIT;
   CREATE TABLE c (cid int);
    ROLLBACK;
  END;
$$;
CREATE PROCEDURE

Note that the first two statements have been committed, while the second transaction has been aborted. You will see what the effect of this change is later in this example.

To run the procedure, we can use CALL:

test=# CALL test_proc();
CALL

The first two tables were committed the third table has not been created because of the rollback inside the procedure:

test=# \d
List of relations
 Schema | Name | Type  | Owner
--------+------+-------+-------
 public | a    | table | hs
 public | b    | table | hs
(2 rows)

Procedures are an important step toward a complete and fully featured database system.

Improving ALTER TABLE

The ALTER TABLE command can be used to change the definition of a table. In PostgreSQL 11, the behavior of ALTER TABLE ... ADD COLUMN has been improved substantially. Let's take a look at the details. The following examples shows how columns can be added to a table and how PostgreSQL will handle those new columns:

ALTER TABLE x ADD COLUMN y int;
ALTER TABLE x ADD COLUMN z int DEFAULT 57;

The first command in the listing has always been fast, the reason being that, in PostgreSQL, the default value of a column is NULL. So, what PostgreSQL does is add a column to the system catalog without actually touching storage. The column will be added to the end of the table so that if the row is too short on disk, we know that it will be NULL anyway. In other words, even if you add a column to a 10 TB table, the operation will be really fast because the system doesn't have to change rows on disk.

The situation used to be quite different in the second case. DEFAULT 57 actually does add real data to the row, and in PostgreSQL 10 and older, this meant that the database had to rewrite the entire table to add this new default value. If you have a small table, it isn't a big deal. However, if your table contains billions of rows, you can't just lock it up and rewrite it in a professional online transaction processing (OLTP) system, downtime is out of the question.

Starting with PostgreSQL 11, it is possible to add immutable default values to a table without rewriting the entire table, which greatly reduces the burden of a changing data structure.