Preserving replication slots across major Postgres versions - PostgreSQL high availability for major upgrades

In this blog (the third in my series), I'd like to present yet another new feature in the PostgreSQL 17 release: enhancement to logical replication functionality in PostgreSQL. The blog will also provide a small script that demonstrates how to use this feature when upgrading from Postgres 17 to a future version. In my prior blogs, (also published on Planet PostgreSQL, and DZone) I have written about other PG-17 features which you can read about:

PostgreSQL 17 is a really powerful major release from the PG community - with this new release, community focus continues to be on making PostgreSQL even more performant, scalable, secure, and enterprise ready. Postgres 17 also improves the developer experience by adding new features for compatibility, and making existing features more powerful and robust.

These features also help products that provide distributed Postgres improve their PostgreSQL high availability (HA) experience, especially related to system upgrades across major versions. pgEdge also provides a PostgreSQL-based distributed database platform for low latency, high availability, and data residency. The HA capabilities of pgEdge ensure that major PostgreSQL upgrades can be done with nearly zero downtime so your applications can continue to work without user interruption. There is work in progress that will provide a path to a zero downtime upgrade by adding and removing nodes from the cluster. This is not the main topic of this blog but stay tuned for more about this functionality.

The diagram below (from one of my older blogs, updated for PG-17) describes the evolution of the logical replication feature in PostgreSQL. The building blocks for logical replication were added in PostgreSQL 9.4, but the logical replication feature wasn't added until PostgreSQL 10. Since then, there have been a number of important improvements to logical replication.

Picture1blog

Preserving Replication Slots

Now coming back to our topic, this new feature makes it possible to preserve replication slots while performing upgrades between major versions of Postgres, eliminating the requirement to resync the data between two nodes that were replicating the data using logical replication. Please note this feature is only available for use when performing upgrades from Postgres 17 to future major versions. Upgrades from versions prior to Postgres 17 still need to follow the process of recreating the replication slots and creating subscribers that rsync the data between the replicating nodes.

This patch was authored by Hayato Kuroda and Hou Zhijie and committed by Amit Kapila. Here is the Postgres commit log entry for this feature:

commit 29d0a77fa6606f9c01ba17311fc452dabd3f793d

Author: Amit Kapila <[email protected]>

Date: Thu Oct 26 06:54:16 2023 +0530

Migrate logical slots to the new node during an upgrade.

While reading information from the old cluster, a list of logical

slots is fetched. At the later part of upgrading, pg_upgrade revisits the

list and restores slots by executing pg_create_logical_replication_slot()

on the new cluster. Migration of logical replication slots is only

supported when the old cluster is version 17.0 or later.

If the old node has invalid slots or slots with unconsumed WAL records,

the pg_upgrade fails. These checks are needed to prevent data loss.

The significant advantage of this commit is that it makes it easy to

continue logical replication even after upgrading the publisher node.

Previously, pg_upgrade allowed copying publications to a new node. With

this patch, adjusting the connection string to the new publisher will

cause the apply worker on the subscriber to connect to the new publisher

automatically. This enables seamless continuation of logical replication,

even after an upgrade.

Sample script

Now let's write a little script to test this feature; as I've mentioned before this only works when you are upgrading from Postgres 17 to a future major release. Any replication slots on the old cluster that are invalid or have unconsumed WAL will need to be repaired prior to the upgrade or the upgrade will fail.

Please note that the script below uses Postgres 17.2 for both the old and new clusters to demonstrate the functionality; we'll have to wait for another major version to become available before we can actually show the functionality at its best. The results from the script are also listed below, showing the replication slot created in the old cluster has been copied over to the new cluster.

#!/bin/bash

# Declaration part

OLD_BIN=/home/pgedge/pg17/bin
NEW_BIN=/home/pgedge/pg17/bin

# Cleanup previous results

pg_ctl stop -D data_new
pg_ctl stop -D data_old

rm -rf data_* *log

# Initialize old node

initdb -D data_old -U pgedge

cat << EOF >> data_old/postgresql.conf
wal_level=logical
EOF

pg_ctl start -D data_old -l old.log

# Define a logical replication slot

psql -U pgedge -d postgres -c "SELECT * FROM pg_create_logical_replication_slot('test', 'test_decoding');"

# Stop the old node to run pg_upgrade

pg_ctl stop -D data_old

# Initialize a new node. This will be a target of the 
# pg_upgrade command

initdb -D data_new -U pgedge

#wal_level must be logical to define a logical replication slot

cat << EOF >> data_new/postgresql.conf
wal_level=logical
EOF

# Run pg_upgrade command. If the old node is PG17+, "logical"
# replication slots will be copied to the new node

pg_upgrade -b $OLD_BIN -B $NEW_BIN -d data_old -D data_new -U pgedge

# Let's confirm the existence of slot

pg_ctl start -D data_new -l new.log
psql -U pgedge -d postgres -c "SELECT * FROM pg_replication_slots;"
pg_ctl stop -D data_new

Results from running the script :

./17_upgrade.sql 
pg_ctl: PID file "data_new/postmaster.pid" does not exist
Is server running?
pg_ctl: PID file "data_old/postmaster.pid" does not exist
Is server running?
The files belonging to this database system will be owned by user "pgedge".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory data_old ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default "max_connections" ... 100
selecting default "shared_buffers" ... 128MB
selecting default time zone ... Asia/Karachi
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

pg_ctl -D data_old -l logfile start

waiting for server to start.... done
server started
slot_name |    lsn    
-----------+-----------
test      |    0/14D6490
(1 row)

waiting for server to shut down.... done
server stopped
The files belonging to this database system will be owned by user "pgedge".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory data_new ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default "max_connections" ... 100
selecting default "shared_buffers" ... 128MB
selecting default time zone ... Asia/Karachi
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

pg_ctl -D data_new -l logfile start

Performing Consistency Checks
-----------------------------
Checking cluster versions                                     ok
Checking database user is the install user                    ok
Checking database connection settings                         ok
Checking for prepared transactions                            ok
Checking for contrib/isn with bigint-passing mismatch         ok
Checking for valid logical replication slots                  ok
Checking for subscription state                               ok
Checking data type usage                                      ok
Creating dump of global objects                               ok
Creating dump of database schemas                             ok
Checking for presence of required libraries                   ok
Checking database user is the install user                    ok
Checking for prepared transactions                            ok
Checking for new cluster tablespace directories               ok
Checking for new cluster logical replication slots            ok

If pg_upgrade fails after this point, you must re-initdb the
new cluster before continuing.

Performing Upgrade
------------------
Setting locale and encoding for new cluster                   ok
Analyzing all rows in the new cluster                         ok
Freezing all rows in the new cluster                          ok
Deleting files from new pg_xact                               ok
Copying old pg_xact to new server                             ok
Setting oldest XID for new cluster                            ok
Setting next transaction ID and epoch for new cluster         ok
Deleting files from new pg_multixact/offsets                  ok
Copying old pg_multixact/offsets to new server                ok
Deleting files from new pg_multixact/members                  ok
Copying old pg_multixact/members to new server                ok
Setting next multixact ID and offset for new cluster          ok
Resetting WAL archives                                        ok
Setting frozenxid and minmxid counters in new cluster         ok
Restoring global objects in the new cluster                   ok
Restoring database schemas in the new cluster                 ok
Copying user relation files.                                  ok
Setting next OID for new cluster                              ok
Restoring logical replication slots in the new cluster.       ok
Sync data directory to disk                                   ok
Creating script to delete old cluster                         ok
Checking for extension updates                                ok

Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade.
Once you start the new server, consider running:
/home/pgedge/pg17/bin/vacuumdb -U pgedge --all --analyze-in-stages
Running this script will delete the old cluster's data files:
./delete_old_cluster.sh
waiting for server to start.... done
server started
 slot_name |  plugin  |  slot_type  |  datoid  |  database  |  temporary  |  active  |  active_pid  |  xmin  |  catalog_xmin  |  restart_lsn  |  confirmed_flush_lsn |  wal_status  | 
 safe_wal_size  |  two_phase  |   inactive_since  |  conflicting  | invalidation_reason  |  failover | synced 
-----------+---------------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+-
--------------+-----------+-------------------------------+-------------+---------------------+----------+--------
 test  |  test_decoding  |  logical   |  5  |  postgres  |  f  |  f  |    |    |  760  |  0/80000D8  |  0/8000110  |  reserved  |   |  f  |  2025-01-23 18:13:59.623016+05  |   f  |   |  f  | f
(1 row)

waiting for server to shut down.... done
server stopped

Conclusion - Distributed PostgreSQL

It is fair to say that logical replication has really matured and stabilized over the years. However, providing true multi-master capabilities with logical replication requires more extensive functionality that includes conflict management, automatic DDL, and so on. Over the last two years, pgEdge has developed true multi-master capabilities built on top of logical replication infrastructure. The pgEdge Distributed PostgreSQL solution provides high availability, low latency and data residency features. It can be deployed using pgEdge Cloud managed services in the cloud, or you can use the pgEdge Platform Enterprise Edition to deploy on-premises, in your own cloud accounts, and in air-gapped environments. The conflict handling capabilities of pgEdge Distributed PostgreSQL cluster are robust and sophisticated, with the ability to automatically manage conflicts on INSERTs, UPDATEs and DELETEs. These capabilities are requirements for a true multi-master setup where you have an application sending writes to multiple cluster nodes in parallel.

I want to thank the authors Hayato Kuroda and Hou Zhijie for this important feature and Hayato for helping me with the script to test the feature.

Preserving replication slots across major Postgres versions - PostgreSQL high availability for major upgrades

Preserving Replication Slots

Sample script

Conclusion - Distributed PostgreSQL

Ahsan Hadi

Get started today.

Preserving replication slots across major Postgres versions - PostgreSQL high availability for major upgrades

Preserving Replication Slots

Sample script

Conclusion - Distributed PostgreSQL

Ahsan Hadi

SUBSCRIBE TO BLOG

Get started today.