Book Image

PostgreSQL 12 High Availability Cookbook - Third Edition

By : Shaun Thomas
Book Image

PostgreSQL 12 High Availability Cookbook - Third Edition

By: Shaun Thomas

Overview of this book

Databases are nothing without the data they store. In the event of an outage or technical catastrophe, immediate recovery is essential. This updated edition ensures that you will learn the important concepts related to node architecture design, as well as techniques such as using repmgr for failover automation. From cluster layout and hardware selection to software stacks and horizontal scalability, this PostgreSQL cookbook will help you build a PostgreSQL cluster that will survive crashes, resist data corruption, and grow smoothly with customer demand. You’ll start by understanding how to plan a PostgreSQL database architecture that is resistant to outages and scalable, as it is the scaffolding on which everything rests. With the bedrock established, you'll cover the topics that PostgreSQL database administrators need to know to manage a highly available cluster. This includes configuration, troubleshooting, monitoring and alerting, backups through proxies, failover automation, and other considerations that are essential for a healthy PostgreSQL cluster. Later, you’ll learn to use multi-master replication to maximize server availability. Later chapters will guide you through managing major version upgrades without downtime. By the end of this book, you’ll have learned how to build an efficient and adaptive PostgreSQL 12 database cluster.
Table of Contents (17 chapters)

Preventing split brain

Split brain is the scenario that occurs when more than one primary node is active in a PostgreSQL cluster simultaneously. In these circumstances, if any data was written to both nodes from the application, it becomes extremely difficult to rectify. Certainly, no cluster with such data corruption can be considered highly available!

In this recipe, we will further explore the concept, and how we might mitigate this problem.

Getting ready

An important concept necessary for preventing split brain scenarios is fencing, or isolation of a node from the application and database stack. Often, this is accomplished through STONITH (which stands for Shoot The Other Node In The Head). After accounting for situations where this is not possible, the old primary must invoke SMITH instead, or Shoot Myself In The Head. While it may sound extreme, for servers, this is really a temporary solution to prevent a more worrying complication.

Keep these terms in mind while we explore how they may affect our architecture.

How to do it...

Apply these steps when designing a cluster to help minimize or eliminate the risk of split brain:

  1. If available, allocate STONITH hardware for nodes that may take the role of the Primary.
  2. Consider situations where SMITH must be used instead, during network interruptions.
  3. Ensure PostgreSQL does not start automatically following a system reboot.

How it works...

Modern servers are often equipped with hardware that enables remote administration. These components often allow network access to the boot process itself. In the case where PostgreSQL is installed on virtual servers, the hypervisor serves this role. Many companies commonly install Power Distribution Units (PDU) that can be remotely instructed to cut power to a server.

Whatever the approach, working with infrastructure or systems operations teams is likely necessary to gain access to interact with these devices. As an example, imagine we have a 2-node cluster consisting of a Primary and Standby, and a PDU is available for each. We could do something like this:

In this case, in the case of a failover and subsequent promotion, the Standby could instruct the PDU to cut power to the Primary to ensure that it wasn't possible for applications to be connected. But what about a scenario where it's far more common for network interruptions, such as between two data centers, as seen here:

In this scenario, the Standby would be unable to shut down the Primary node in the case of a failover. This is why it's important to consider SMITH approaches as well. In this case, the Primary should monitor the state of the Standby and Witness, and if both disconnect for a sufficient period, it should assume the Standby will be promoted. In this case, it would power itself down to prevent applications from interacting with it.

Finally, always disable the PostgreSQL start up mechanism on clusters equipped with high availability management software. That software should manage starting and stopping the service, and this will also prevent unintended events such as accidentally having two primary nodes active simply because a reboot started a previously failed PostgreSQL node.

Advanced recipes later in this book will adhere to the rule of disabling PostgreSQL on startup and provide exact instructions for doing so.

There's more...

The reason preventing split brain is so difficult is that it's not an easy problem to solve. This is why dedicated software for managing high availability exists. While these are not immune to the issue, they greatly reduce the potential of occurrence.

Pacemaker has components specifically for interacting with STONITH hardware. repmgr implements the concept of hook scripts for event notifications, and accounts for scenarios where the Primary is isolated from the remainder of the cluster as seen in the previous diagram. Patroni uses a sophisticated locking mechanism that only allows one primary node to be registered at once.

Don't try to invent a solution for an already solved problem when most of the work has already been done by companies dedicated to the cause.