8 Steps to Proactively Handle PostgreSQL Database Disaster Recovery
When disaster strikes, whether a natural disaster or a technical event, its impact on your network, database, and end-users can cause both costly downtime and data corruption. Data corruption, whether sparked by hardware failures like dying disks or faulty RAM, software glitches such as operating system bugs, or human missteps like accidental overwrites, is a terrifying prospect for any administrator. Yet, it’s not a death sentence.
Your PostgreSQL database is typically a dependable cornerstone of your operations. Still, when disaster strikes, it can swiftly morph into an inaccessible liability, bringing applications to a grinding halt and leaving critical data at risk. PostgreSQL 17 arms you with an enhanced arsenal to tackle this challenge head-on, offering built-in tools like pg_amcheck
for pinpointing corruption, improved failover slot synchronization to keep replication intact during recovery, and finer-grained Write-Ahead Logging (WAL) control for precise restoration. In this blog, we'll dive deep into the disaster management process, equipping you with real-world commands and expected outputs to diagnose corruption accurately and recover effectively, whether you’re restoring from a robust backup or salvaging scraps from a crippled cluster with no safety net. With the right approach, you can turn panic into a plan and restore order to your database.
Step 1: Detecting Corruption in PostgreSQL
Corruption usually doesn’t introduce itself politely; it sneaks in through failed queries, panicked logs, or startup errors. Identifying corruption is the first step towards resolving it.
Check the PostgreSQL Log Files
Start by inspecting the log files. Typically, you'll find the log files in /var/log/postgresql/
or $PGDATA/pg_log
. Within the log files, entry headers indicate the severity level of the log entry; for example:
ERROR: could not read block 128 in file "base/16384/12571": read only 0 of 8192 bytes
PANIC: invalid page in block 42 of relation base/16384/16728
LOG: startup process (PID 1234) was terminated by signal 6: Aborted
FATAL: the database system is in recovery mode
Severity levels indicate:
● ERROR: Read failure, possibly a disk-level issue.
● PANIC: Serious corruption PostgreSQL crashed to prevent further damage.
● FATAL: The server is trying to recover from an unclean shutdown.
command:pg_amcheck -d mydb --all
If your database is healthy, pg_amcheck
returns:
No failures detected in database "mydb"
If pg_amcheck
detects corruption:
heap table "public.orders": block 128 is not valid
btree index "public.orders_idx": block 22 contains invalid tuple
Best Practice: Include the --heapallindexed or --rootdescend flags when you invoke pg_amcheck for deeper validation.
Optional: Verify your checksums
If you specified the --data-checksums
flag when you initialized your PostgreSQL cluster (when running initdb), you can use the pg_checksums tool to detect low-level, file-based corruption across data blocks. pg_checksums
provides an integrity safeguard, allowing PostgreSQL to verify whether the data on the disk has been altered unexpectedly—due to bit rot, disk failures, or faulty hardware.
pg_checksums
must be run while the PostgreSQL server is stopped and will only work if checksums were enabled during cluster initialization. Running pg_checksums
is especially important after unclean shutdowns, system crashes, or if you suspect disk issues.
A clean report indicates that your data blocks are intact, while checksum failures identify specific block-level corruption in table or index files. SQL queries can map these file identifiers (like base/16384/12571) to table names.
The tool doesn’t fix anything—it simply reports which blocks are damaged, allowing you to take the appropriate steps to recover (e.g., restore from backup, isolate affected tables, or investigate hardware issues). Always consider enabling checksums in production environments for better observability and earlier corruption detection.
sudo systemctl stop postgresql
pg_checksums -c -D /var/lib/pgsql/data
Checksum verification failed in file "base/16384/12571", block 128
Best Practice: Enable checksum verification when you initialize each new database. To enable pg_checksums on a new cluster, include the --data-checksums
option when you invoke initdb:
initdb --data-checksums -D /var/lib/pgsql/data
Step 2: Stop PostgreSQL Immediately
When you find data corruption, you should prevent further damage by halting the service:
sudo systemctl stop postgresql
postgresql.service - PostgreSQL RDBMS
Active: inactive (dead)
Step 3: Restore from a Known Good Backup
pgBackRest is a robust and efficient backup and restore solution for PostgreSQL that supports full, differential, and incremental backups, with compression, encryption, parallel processing, and offsite storage (for example, S3). pgBackRest is designed to handle large-scale environments with high performance and minimal impact on the database server. pgBackRest also simplifies disaster recovery by offering automated restore processes, archive management, and point-in-time recovery (PITR) capabilities.
Clean and Restore the Cluster with pgBackRest
Before you restore, take a backup of the corrupted (old) data directory:
cp -rf /var/lib/pgsql/data /var/lib/pgsql/data_backup
After confirming that the backup is saved, wipe the old data directory:
rm -rf /var/lib/pgsql/data/*
Then, restore from your last known good backup:
pgbackrest --stanza=main restore --db-path=/var/lib/pgsql/data
INFO: restore command begin
INFO: restored file base/16384/12571 (16MB, 50%) checksum verified
INFO: restore command end: completed successfully
Then, correct ownership:
After restoring your database, ensure the data directory is correctly owned by the PostgreSQL user:
chown -R postgres:postgres /var/lib/pgsql/data
Step 4: Use Point-in-Time Recovery (PITR)
Using a backup strategy that supports Point-in-time recovery will allow you to stop right before corruption occurs.
Configure Recovery
Add the following commands to your postgresql.conf file:
restore_command = 'cp /mnt/backup/wal/%f %p'
recovery_target_time = '2025-03-25 13:59:00 UTC'
Create the recovery trigger:
touch /var/lib/pgsql/data/recovery.signal
When you start PostgreSQL, you can watch the server recover to the point in time that you specified in the recovery_target_time
parameter:
sudo systemctl start postgresql
LOG: starting point-in-time recovery to "2025-03-25 13:59:00 UTC"
LOG: restored log file "000000010000000000000005" from archive
LOG: consistent recovery state reached
LOG: recovery stopping before commit of transaction 123
LOG: database system is ready to accept connections
Best Practice: Using a backup strategy that supports point-in-time recovery allows you to return to a clean state, just before corruption.
Step 5: Salvage What You Can
If you don’t have a backup but some tables still work, you can use pg_dump and other Postgres tools to extract what you can.
First, use pg_dump
to save the definitions of any readable tables and their data:
pg_dump -t customers mydb > customers.sql
SELECT count(*) FROM customers;
Then, create a new cluster:
initdb -D /var/lib/pgsql/new_data --data-checksums
pg_ctl -D /var/lib/pgsql/new_data -l logfile start
Then, restore salvaged data into your new cluster:
createdb -h /var/lib/pgsql/new_data newdb
psql -d newdb < customers.sql
Best Practice: Maintain a dependable backup strategy for any data that you can't afford to lose. In a crisis, you can use these steps to restore salvaged data, but the restoration may not be complete, and you will still need to manually review and recreate schema objects that may have been damaged. These steps will leave you with a partial recovery in a clean environment.
Step 6: Use pg_resetwal
as the Last Resort
pg_resetwal is a low-level PostgreSQL utility used to forcibly reset a database cluster's write-ahead log (WAL), typically used as a last resort when the server cannot start due to missing or corrupted WAL files. This tool should be used cautiously, as it bypasses normal crash recovery and may lead to data inconsistency or loss of recent transactions. It is only safe to run when you are sure the data files are in a consistent state or when you're attempting to salvage partial data from a failed system.
Only use this tool if all else fails. It resets WAL records, risking transaction loss and corruption.
pg_resetwal -f /var/lib/pgsql/data
sudo systemctl start postgresql
LOG: WAL reset performed
LOG: database system is ready to accept connections
Note: Data added since the last checkpoint may be lost; you should proceed only after consulting experts.
Step 7: Prevent Future Corruption
Don’t let this happen again. PostgreSQL 17 gives you excellent tools to stay protected. In summary, the best practices that can help you recover from a disaster are:
Enable checksums when you initialize your cluster
initdb --data-checksums -D /var/lib/pgsql/data
Automate backups with pgBackRest
pgbackrest --stanza=main --type=full backup
pgbackrest --stanza=main --type=incr backup
Run regular integrity checks with pg_amcheck
pg_amcheck -d mydb --all > /var/log/pg_amcheck_$(date +%F).log
Create a simple cron job to run pg_amcheck
with the command:
0 2 * pg_amcheck -d mydb --all > /var/log/pg_amcheck_$(date +\%F).log 2>&1
Step 8: Embrace High Availability and WAL Archiving
If you have configured a replication solution that allows you to configure high availability and maintain backup nodes, you can promote a replica if the primary fails:
pg_ctl promote -D /var/lib/pgsql/standby
Ensure that you have configured WAL Archiving for PITR; in your postgresql.conf file, set:
archive_mode = on
archive_command = 'cp %p /mnt/wal_archive/%f'
Conclusion
Disaster recovery in PostgreSQL demands quick action and careful planning, and PostgreSQL 17 significantly strengthens your ability to respond. You can handle even the most critical failures with integrated tools like pg_amcheck
for live corruption detection, pgBackRest for reliable backups and PITR, and pg_resetwal
for last-resort recovery. Whether restoring from a clean backup, recovering to a point just before the disaster, or salvaging data from a damaged cluster, this post walks you through every step with actual commands and practical advice.
Remember that recovery doesn’t start when something breaks. It begins with preparation. Make it a goal to turn your PostgreSQL database into a resilient, self-defending system by enabling data checksums, automating backups, monitoring for corruption, and setting up high availability with WAL archiving. In PostgreSQL, disaster may strike, but recovery can occur with the right tools and approach.