8 Steps to Proactively Handle PostgreSQL Database Disaster Recovery

When disaster strikes, whether a natural disaster or a technical event, its impact on your network, database, and end-users can cause both costly downtime and data corruption. Data corruption, whether sparked by hardware failures like dying disks or faulty RAM, software glitches such as operating system bugs, or human missteps like accidental overwrites, is a terrifying prospect for any administrator. Yet, it’s not a death sentence.

Your PostgreSQL database is typically a dependable cornerstone of your operations. Still, when disaster strikes, it can swiftly morph into an inaccessible liability, bringing applications to a grinding halt and leaving critical data at risk. PostgreSQL 17 arms you with an enhanced arsenal to tackle this challenge head-on, offering built-in tools like pg_amcheck for pinpointing corruption, improved failover slot synchronization to keep replication intact during recovery, and finer-grained Write-Ahead Logging (WAL) control for precise restoration. In this blog, we'll dive deep into the disaster management process, equipping you with real-world commands and expected outputs to diagnose corruption accurately and recover effectively, whether you’re restoring from a robust backup or salvaging scraps from a crippled cluster with no safety net. With the right approach, you can turn panic into a plan and restore order to your database.

Step 1: Detecting Corruption in PostgreSQL

Corruption usually doesn’t introduce itself politely; it sneaks in through failed queries, panicked logs, or startup errors. Identifying corruption is the first step towards resolving it.
Check the PostgreSQL Log Files
Start by inspecting the log files. Typically, you'll find the log files in /var/log/postgresql/ or $PGDATA/pg_log. Within the log files, entry headers indicate the severity level of the log entry; for example:

ERROR:  could not read block 128 in file "base/16384/12571": read only 0 of 8192 bytes  
PANIC:  invalid page in block 42 of relation base/16384/16728  
LOG:    startup process (PID 1234) was terminated by signal 6: Aborted  
FATAL:  the database system is in recovery mode

Severity levels indicate:
● ERROR: Read failure, possibly a disk-level issue.
● PANIC: Serious corruption PostgreSQL crashed to prevent further damage.
● FATAL: The server is trying to recover from an unclean shutdown.

command:pg_amcheck -d mydb --all

If your database is healthy, pg_amcheck returns:

No failures detected in database "mydb"

If pg_amcheck detects corruption:

heap table "public.orders": block 128 is not valid  

btree index "public.orders_idx": block 22 contains invalid tuple

Best Practice: Include the --heapallindexed or --rootdescend flags when you invoke pg_amcheck for deeper validation.
Optional: Verify your checksums
If you specified the --data-checksums flag when you initialized your PostgreSQL cluster (when running initdb), you can use the pg_checksums tool to detect low-level, file-based corruption across data blocks. pg_checksums provides an integrity safeguard, allowing PostgreSQL to verify whether the data on the disk has been altered unexpectedly—due to bit rot, disk failures, or faulty hardware.

pg_checksums must be run while the PostgreSQL server is stopped and will only work if checksums were enabled during cluster initialization. Running pg_checksums is especially important after unclean shutdowns, system crashes, or if you suspect disk issues.

A clean report indicates that your data blocks are intact, while checksum failures identify specific block-level corruption in table or index files. SQL queries can map these file identifiers (like base/16384/12571) to table names.

The tool doesn’t fix anything—it simply reports which blocks are damaged, allowing you to take the appropriate steps to recover (e.g., restore from backup, isolate affected tables, or investigate hardware issues). Always consider enabling checksums in production environments for better observability and earlier corruption detection.

sudo systemctl stop postgresql
pg_checksums -c -D /var/lib/pgsql/data
Checksum verification failed in file "base/16384/12571", block 128

Best Practice: Enable checksum verification when you initialize each new database. To enable pg_checksums on a new cluster, include the --data-checksums option when you invoke initdb:

initdb --data-checksums -D /var/lib/pgsql/data

Step 2: Stop PostgreSQL Immediately

When you find data corruption, you should prevent further damage by halting the service:

sudo systemctl stop postgresql
postgresql.service - PostgreSQL RDBMS  
   Active: inactive (dead)

Step 3: Restore from a Known Good Backup

pgBackRest is a robust and efficient backup and restore solution for PostgreSQL that supports full, differential, and incremental backups, with compression, encryption, parallel processing, and offsite storage (for example, S3). pgBackRest is designed to handle large-scale environments with high performance and minimal impact on the database server. pgBackRest also simplifies disaster recovery by offering automated restore processes, archive management, and point-in-time recovery (PITR) capabilities.

Clean and Restore the Cluster with pgBackRest

Before you restore, take a backup of the corrupted (old) data directory:

cp -rf /var/lib/pgsql/data /var/lib/pgsql/data_backup

After confirming that the backup is saved, wipe the old data directory:

rm -rf /var/lib/pgsql/data/*

Then, restore from your last known good backup:

pgbackrest --stanza=main restore --db-path=/var/lib/pgsql/data
INFO: restore command begin  
INFO: restored file base/16384/12571 (16MB, 50%) checksum verified  
INFO: restore command end: completed successfully

Then, correct ownership:

After restoring your database, ensure the data directory is correctly owned by the PostgreSQL user:

chown -R postgres:postgres /var/lib/pgsql/data

Step 4: Use Point-in-Time Recovery (PITR)

Using a backup strategy that supports Point-in-time recovery will allow you to stop right before corruption occurs.

Configure Recovery

Add the following commands to your postgresql.conf file:

restore_command = 'cp /mnt/backup/wal/%f %p'
recovery_target_time = '2025-03-25 13:59:00 UTC'

Create the recovery trigger:

touch /var/lib/pgsql/data/recovery.signal

When you start PostgreSQL, you can watch the server recover to the point in time that you specified in the recovery_target_time parameter:

sudo systemctl start postgresql
LOG: starting point-in-time recovery to "2025-03-25 13:59:00 UTC"  
LOG: restored log file "000000010000000000000005" from archive  
LOG: consistent recovery state reached  
LOG: recovery stopping before commit of transaction 123
LOG: database system is ready to accept connections

Best Practice: Using a backup strategy that supports point-in-time recovery allows you to return to a clean state, just before corruption.

Step 5: Salvage What You Can

If you don’t have a backup but some tables still work, you can use pg_dump and other Postgres tools to extract what you can.

First, use pg_dump to save the definitions of any readable tables and their data:

pg_dump -t customers mydb > customers.sql
SELECT count(*) FROM customers;

Then, create a new cluster:

initdb -D /var/lib/pgsql/new_data --data-checksums
pg_ctl -D /var/lib/pgsql/new_data -l logfile start

Then, restore salvaged data into your new cluster:

createdb -h /var/lib/pgsql/new_data newdb
psql -d newdb < customers.sql

Best Practice: Maintain a dependable backup strategy for any data that you can't afford to lose. In a crisis, you can use these steps to restore salvaged data, but the restoration may not be complete, and you will still need to manually review and recreate schema objects that may have been damaged. These steps will leave you with a partial recovery in a clean environment.

Step 6: Use `pg_resetwal` as the Last Resort

pg_resetwal is a low-level PostgreSQL utility used to forcibly reset a database cluster's write-ahead log (WAL), typically used as a last resort when the server cannot start due to missing or corrupted WAL files. This tool should be used cautiously, as it bypasses normal crash recovery and may lead to data inconsistency or loss of recent transactions. It is only safe to run when you are sure the data files are in a consistent state or when you're attempting to salvage partial data from a failed system.

Only use this tool if all else fails. It resets WAL records, risking transaction loss and corruption.

pg_resetwal -f /var/lib/pgsql/data
sudo systemctl start postgresql
LOG: WAL reset performed 
LOG: database system is ready to accept connections

Note: Data added since the last checkpoint may be lost; you should proceed only after consulting experts.

Step 7: Prevent Future Corruption

Don’t let this happen again. PostgreSQL 17 gives you excellent tools to stay protected. In summary, the best practices that can help you recover from a disaster are:

Enable checksums when you initialize your cluster

initdb --data-checksums -D /var/lib/pgsql/data

Automate backups with pgBackRest

pgbackrest --stanza=main --type=full backup  
pgbackrest --stanza=main --type=incr backup

Run regular integrity checks with pg_amcheck

pg_amcheck -d mydb --all > /var/log/pg_amcheck_$(date +%F).log

Create a simple cron job to run pg_amcheck with the command:

0 2  * pg_amcheck -d mydb --all > /var/log/pg_amcheck_$(date +\%F).log 2>&1

Step 8: Embrace High Availability and WAL Archiving

If you have configured a replication solution that allows you to configure high availability and maintain backup nodes, you can promote a replica if the primary fails:

pg_ctl promote -D /var/lib/pgsql/standby

Ensure that you have configured WAL Archiving for PITR; in your postgresql.conf file, set:

archive_mode = on  
archive_command = 'cp %p /mnt/wal_archive/%f'

Conclusion

Disaster recovery in PostgreSQL demands quick action and careful planning, and PostgreSQL 17 significantly strengthens your ability to respond. You can handle even the most critical failures with integrated tools like pg_amcheck for live corruption detection, pgBackRest for reliable backups and PITR, and pg_resetwal for last-resort recovery. Whether restoring from a clean backup, recovering to a point just before the disaster, or salvaging data from a damaged cluster, this post walks you through every step with actual commands and practical advice.

Remember that recovery doesn’t start when something breaks. It begins with preparation. Make it a goal to turn your PostgreSQL database into a resilient, self-defending system by enabling data checksums, automating backups, monitoring for corruption, and setting up high availability with WAL archiving. In PostgreSQL, disaster may strike, but recovery can occur with the right tools and approach.

8 Steps to Proactively Handle PostgreSQL Database Disaster Recovery

Step 1: Detecting Corruption in PostgreSQL

Step 2: Stop PostgreSQL Immediately

Step 3: Restore from a Known Good Backup

Clean and Restore the Cluster with pgBackRest

Step 4: Use Point-in-Time Recovery (PITR)

Step 5: Salvage What You Can

Step 6: Use `pg_resetwal` as the Last Resort

Step 7: Prevent Future Corruption

Step 8: Embrace High Availability and WAL Archiving

Conclusion

Ibrar Ahmed

Principal Engineer

Get started today.

8 Steps to Proactively Handle PostgreSQL Database Disaster Recovery

Step 1: Detecting Corruption in PostgreSQL

Step 2: Stop PostgreSQL Immediately

Step 3: Restore from a Known Good Backup

Clean and Restore the Cluster with pgBackRest

Step 4: Use Point-in-Time Recovery (PITR)

Step 5: Salvage What You Can

Step 6: Use pg_resetwal as the Last Resort

Step 7: Prevent Future Corruption

Step 8: Embrace High Availability and WAL Archiving

Conclusion

Ibrar Ahmed

Principal Engineer

SUBSCRIBE TO BLOG

Get started today.

Step 6: Use `pg_resetwal` as the Last Resort