PostgreSQL Disaster Recovery#

A DR plan for PostgreSQL has three layers: streaming replication for fast failover, WAL archiving for point-in-time recovery, and a backup tool like pgBackRest for managing retention. Each layer covers a different failure mode – replication for server crashes, WAL archiving for data corruption that replicates, full backups for when everything goes wrong.

Streaming Replication for DR#

Synchronous vs Asynchronous – The Core Tradeoff#

Asynchronous replication is the default. The primary streams WAL to the standby, but does not wait for confirmation before committing. This means the primary is fast, but the standby can be seconds behind. If the primary dies, those uncommitted-on-standby transactions are lost.

Synchronous replication makes the primary wait until the standby confirms it has written the WAL to disk. Zero data loss, but every write now includes the round-trip latency to the standby.

Same-region sync replication adds 1-3ms per commit. Cross-region sync replication adds 30-80ms per commit. For a workload doing 500 commits/second, cross-region sync replication is usually not viable.

# postgresql.conf on primary
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
synchronous_commit = on

A practical middle ground: use synchronous_commit = remote_write instead of on. The primary waits for the standby to receive the WAL and write it to the OS cache, but not to flush it to disk. This protects against primary failure but not against simultaneous primary and standby power loss.

Initial Seeding with pg_basebackup#

The standby needs a full copy of the primary’s data directory before it can start receiving WAL.

pg_basebackup -h primary.db.internal -U replicator \
  -D /var/lib/postgresql/16/main \
  --checkpoint=fast --wal-method=stream -R -P

The -R flag writes standby.signal and connection info into postgresql.auto.conf so the standby connects to the primary on startup. For a 500 GB database, expect this to take 30-60 minutes over a 1 Gbps link.

WAL Archiving to Object Storage#

Streaming replication is only as durable as your replicas. If a bug corrupts data, that corruption replicates. WAL archiving to S3 or GCS gives you an independent timeline you can recover to any point within your retention window.

Configure Archive Command#

# postgresql.conf
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
archive_timeout = 60

Using pgBackRest’s archive-push instead of a raw aws s3 cp gives you parallel archiving, compression, checksumming, and de-duplication.

pgBackRest Configuration#

# /etc/pgbackrest/pgbackrest.conf
[main]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

[global]
repo1-type=s3
repo1-s3-bucket=mycompany-pg-backups
repo1-s3-region=us-east-1
repo1-s3-endpoint=s3.amazonaws.com
repo1-path=/pgbackrest
repo1-retention-full=4
repo1-retention-diff=14
repo1-cipher-type=aes-256-cbc
repo1-cipher-pass=<encryption-passphrase>
compress-type=zst
compress-level=3

[global:archive-push]
compress-level=3

Backup Schedule#

# Full backup weekly (Sunday 2 AM)
0 2 * * 0  pgbackrest --stanza=main --type=full backup
# Differential backup daily (2 AM other days)
0 2 * * 1-6  pgbackrest --stanza=main --type=diff backup

A 500 GB database typically produces a 100-120 GB compressed full backup. Differentials are usually 5-15% of full size depending on churn.

Point-in-Time Recovery (PITR)#

This is the scenario: someone ran DELETE FROM orders WHERE status = 'pending' without a WHERE clause at 14:32:17 UTC on Tuesday. You need to recover the database to 14:32:16 UTC.

Step 1 – Identify the Target Time#

-- Check the current WAL position before starting
SELECT pg_current_wal_lsn(), now();

Talk to the team. Confirm the exact timestamp. Get it wrong and you either lose more data or include the bad transaction.

Step 2 – Restore with pgBackRest#

# Stop PostgreSQL
sudo systemctl stop postgresql-16

# Restore to a new data directory (safer than overwriting)
pgbackrest --stanza=main --type=time \
  --target="2026-02-18 14:32:16+00" \
  --target-action=promote \
  --set=20260216-020005F \
  --db-path=/var/lib/postgresql/16/recovery \
  restore

The --set flag picks which full backup to restore from. pgBackRest then replays WAL segments from the archive up to the target time.

Step 3 – Verify Before Promoting#

Start PostgreSQL against the recovery directory in single-user mode or on a different port. Verify the data looks correct:

pg_ctl -D /var/lib/postgresql/16/recovery start -o "-p 5433"
psql -p 5433 -d myapp -c "SELECT count(*) FROM orders WHERE status = 'pending';"

Once verified, promote it or swap data directories.

Recovery Timing#

For a 500 GB database with 48 hours of WAL: restoring the base backup takes 20-40 minutes from S3, replaying WAL takes 10-30 minutes depending on write volume. Total recovery: 30-70 minutes. This is your actual RTO for PITR scenarios.

Cross-Region DR with Patroni#

Patroni manages automated failover within a region. For cross-region DR, run a Patroni standby cluster in the DR region that replicates from the primary cluster.

# patroni.yml for DR cluster
bootstrap:
  dcs:
    standby_cluster:
      host: primary-pgbouncer.us-east-1.internal
      port: 5432
      create_replica_methods:
        - basebackup
  pg_hba:
    - host replication replicator 10.0.0.0/8 scram-sha-256
    - host all all 10.0.0.0/8 scram-sha-256

postgresql:
  parameters:
    hot_standby: 'on'
    wal_level: replica

The DR cluster runs as a cascading standby. It cannot accept writes. To failover to the DR region:

# Remove the standby_cluster configuration from DCS
patronictl -c /etc/patroni/patroni.yml edit-config
# Delete the standby_cluster block, save

# The DR cluster will automatically promote a member to primary

This gives you a cross-region RTO of 2-5 minutes (detect failure + operator decision + Patroni promotion). The RPO depends on replication lag, typically 1-10 seconds for async cross-region.

Monitoring Replication Lag#

-- On the primary: check each standby's lag
SELECT client_addr, state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
  replay_lag
FROM pg_stat_replication;

Alert if replay lag exceeds your RPO. If your RPO is 30 seconds, alert at 15 seconds so you have time to investigate before it matters.

Testing Failover#

Untested failover is not failover. Run this quarterly.

# On the primary: simulate a crash
sudo systemctl stop postgresql-16

# On the standby: promote
pg_ctl promote -D /var/lib/postgresql/16/main

# Verify the standby is accepting writes
psql -c "CREATE TABLE failover_test (id serial); DROP TABLE failover_test;"

Time every step. Document the actual RTO. The first time, something will go wrong – misconfigured pg_hba.conf, stale DNS, an application connection string that does not support failover. Finding these during a drill is the entire point.

Recovery Scenarios Summary#

Scenario	Method	Typical RTO	RPO
Primary crash, same region	Patroni auto-failover	10-30 seconds	0 (sync) or seconds (async)
Primary crash, cross-region	Patroni standby cluster promote	2-5 minutes	1-10 seconds (async)
Data corruption (bad DELETE)	PITR from pgBackRest	30-70 minutes	To the second
Complete region loss	pgBackRest restore from S3	30-90 minutes	Minutes (last archived WAL)
Backup verification	pgBackRest restore –type=immediate	20-40 minutes	N/A