Book Image

Instant PostgreSQL Backup and Restore How-to

Book Image

Instant PostgreSQL Backup and Restore How-to

Overview of this book

Backing up a database is something that all system admins must eventually face. Larger warehouses and critical enterprise data requires advanced techniques and tools to handle this complex data. PostgreSQL provides you with the basic tools for accomplishing this complex task. "Instant PostgreSQL Backup and Restore How-to" is a practical series of proven recipes showing you how to preserve critical business data, and also teach you some advanced methods for restoring this data. It is a perfect manual for managing your critical PostgreSQL data. Instant PostgreSQL Backup and Restore How-to is your practical guide to the unique features that PostgreSQL offers to create quick backups and efficient restores. Starting by backing up a simplistic database and learning how-to restore it to working order, we then move on to more complex backup techniques including creating and restoring binary backups. Covering advanced topics such as Warm and Hot standby restore and Streaming replication, Instant PostgreSQL Backup and Restore How-to gives you the power to make complete backups to guarantee you will always be able to restore your database to full working order!
Table of Contents (7 chapters)

Stepping into TAR backups (Intermediate)


For a very long time, the Unix tar command was one of the only methods for obtaining a full binary backup of a PostgreSQL database. This is still the case for more advanced installations which may make use of filesystem snapshots, extensively utilize tablespaces, or otherwise disrupt the included management tools. For these advanced scenarios and more, tar is indispensable for circumventing or augmenting the provided tools.

Getting ready

Please refer to the Getting a basic export (Simple) recipe on preparing a sample database.

For the purposes of this example, we will assume that the database directory is /db, and the archived files will go to /archive. Based on this, we need to modify the postgresql.conf file to archive transaction logs during the backup. Change the following configuration variables:

archive_mode = on
archive_command = 'test -f /archive/%f || cp %p /archive/%f'

After PostgreSQL is restarted, the database will be ready for a tar backup.

How to do it...

Creating a tar backup is done with the following three basic steps, plus a fourth set of commands that are considered as good practice:

  1. First, tell PostgreSQL to enter backup mode:

    $> psql -c "SELECT pg_start_backup('label');" postgres
    
  2. Next, we produce the actual backup with tar, instead of pg_dump or pg_basebackup:

    $> tar -c -z -f backup.tar.gz /db
    
  3. Finally, we tell the database to end backup mode:

    $> psql -c "SELECT pg_stop_backup();" postgres
    
  4. We also need the transaction logs archived during the backup. Type these commands as shown:

    $> recent=$(ls -r /archive/*.backup | head -1)
    $> bstart=$(grep 'START WAL' $recent | sed 's/.* //; s/)//;')
    $> echo $bstart > /tmp/MANIFEST
    $> echo $recent >> /tmp/MANIFEST
    $> find /archive -newer /archive/$bstart \
      ! -newer $recent >> /tmp/FILES
    $> sed 's%/.*/%%' /tmp/MANIFEST | sort | uniq \
      > /archive/MANIFEST
    $> tar -C /archive -cz -f archive.tar.gz \
      –files-from=/archive/MANIFEST
    

Obviously, much of this can (and should) be scripted. These commands were designed for a standard Linux system. If you are using BSD or another variant, you many need to convert them before doing this yourself.

How it works...

The tar command for creating a backup itself is fairly simple: creating (-c) a .gzip compressed (-z) file named backup.tar.gz from the contents of /db, wherever our database lives. Of course, these data files are likely to be changing while they're being backed up, because the process itself can take minutes or hours depending on the size of the database.

Because of this, we call pg_start_backup to start the backup process. To begin with, it will commit pending writes to the database files (checkpoint). Afterwards, it will continue normal operation, but will also keep track of which transaction files were produced during the backup. This is important for future restores.

Next we invoke pg_stop_backup to complete the backup. This command not only finishes the backup, but also creates a file with a .backup extension that identifies the first and last archive logs necessary to restore the database to full working order. We need the first, last, and every transaction log in between to restore, which is what the last set of commands is for.

Knowing that the most recent .backup file archived by the database contains this information, we parse it using various Unix commands to identify every file between the first marked archive log, and the end of the backup itself. No file is older than the .backup file. All of these files are required to fully restore the database, and the process itself is fairly complicated.

We highly recommend implementing a more robust and tested version of the outlined steps, or using a preexisting backup library or third-party tool. For example, OmniPITR is often recommended. Our quick and dirty method works, but it should be fairly obvious why pg_basebackup automates and abstracts away most of the complexity in our example. We gain flexibility here, not ease of use.

There's more...

Now we should discuss exactly what kind of flexibility we may gain.

Parallel compression

Compressing files is very CPU intensive; pigz and pbzip2 are still very handy, and tar works very well with external utilities. We can alter the archival command for the /db directory from the previous commands with the -I flag to choose our own compression program, as shown in the following command:

$> tar -c -I pigz -f backup.tar.gz /db

Alternatively, since pigz can take parameters for choosing the number of threads, or because certain versions of tar don't support the -I flag, we can send the output of tar to pigz instead by using the following command:

$> tar -c /db | pigz -p 4 > backup.tar.gz

Unlike pg_basebackup, these tar commands work with complex databases that make extensive use of tablespaces. Each tablespace can be handled separately and compressed in parallel, drastically reducing compression time.

Some may argue that pg_basebackup does support tablespaces, and it does create .tar.gz files for every user-created tablespace in the database along with base.tar.gz. However, the tar output format of this tool will not stream to standard output if there are user-created tablespaces. This means that our trick of capturing the stream with pigz would not work in such advanced systems. Hence, we used tar in this example.

Making a tar backup standby-ready

With PostgreSQL, a database in standby or streaming mode will not have its own transaction logs while recovery is in progress, since it uses some other source of archived transaction logs to apply changes to the database. This means that backing these files up is often excessive. Remember, we mentioned that pg_basebackup omits them by default for similar reasons. Thankfully, tar can also exclude them, or any other paths. Again, we will modify the /db backup command as follows:

$> tar -c -z -f backup.tar.gz --exclude=pg_xlog  /db

Now, if the backup.tar.gz file is uncompressed, it can only be used for standby or streaming replication.

Backing up tablespaces

We keep talking about tablespaces, but how do we handle them? PostgreSQL tablespaces reside in the pg_tblspc subdirectory of the database instance. To back these up separately, we want to tell tar to ignore them using the following command:

$> tar -c -z -f backup.tar.gz --exclude=pg_tblspc  /db

Now we can back up all of the tablespaces that live in that directory. Something like the following bash snippet could do the job nicely:

for f in $(find /db/pg_tblspc); do
    d=$(readlink -f $f)
    n=$(basename $f)
    tar -czf /backup/$n.tar.gz -C $d .
done

Each tablespace will be compressed in a separate file named after its database object identifier. Incidentally, this is exactly what pg_basebackup would do. But we can alter any of these commands as much as we desire. For instance, to introduce parallel compression we can use the following shell code:

for f in $(find /db/pg_tblspc); do
    d=$(readlink -f $f)
    n=$(basename $f)
    tar -C $d . -c | pigz -p 8 > /backup/$n.tar.gz
done

Now we can do something pg_basebackup can't, and use parallel compression on all tablespaces in the database instead of just the default compression. That is just one example of what we can modify within the backup process itself. It is a bit more work, but scripts can automate most or all of these extra steps.

Backing up archived transaction logs

Since our database is in archive mode, it may be producing archived transaction logs at a precipitous rate. If we want to do point in time recovery (PITR) or certain types of database standby, it is generally a good idea to preserve these to tape as well. For example, if executed at midnight, the following snippet of bash would compress the previous day's archived logs:

find /archive/ -type f -name '0000*' -mtime +0 ! \
       -printf '%f\n' >> /archive/MANIFEST
tar -C /archive/ --files-from=/archive/MANIFEST -c -I pigz \
      -f /backup/axlog.$(date -d '-1 day' +"%Y-%m-%d").tar.gz
find /archive/ -type f -name '0000*' -mtime +0 -delete

Now these backup files can be stored elsewhere for PITR restores. For very important databases, being able to use old backup files and transaction logs means that the database can be restored to any previous time since its creation. Always be sure to have access to old archive logs if this is a desired option.