Organizations and end users increasingly demand that applications be always on, always available, and always responsive, ensuring high availability, fault tolerance, and seamless scalability of databases becomes paramount. PostgreSQL, a robust and versatile relational database management system, offers several replication methods to meet these needs. This blog will delve into the intricacies of multi-master replication (MMR) in PostgreSQL, exploring its architectures, solutions, and the challenges it addresses.

PostgreSQL Replication Methods

PostgreSQL supports various replication methods, each tailored to specific requirements and use cases:

Streaming Replication

Streaming replication is a popular method for real-time data replication. It involves a primary server sending data changes to one or more standby servers. This method ensures high availability and load balancing, making it suitable for environments requiring real-time data consistency and minimal downtime.

Synchronous vs. Asynchronous Replication

  • Synchronous Replication: In this method, transactions must be confirmed by both the primary and standby servers before being committed. This ensures data consistency but may affect performance due to the added confirmation step.

Synchronous Replication Disgram

  • Asynchronous Replication: This method is faster as it does not require confirmation from the standby servers before committing data. However, it does not guarantee immediate consistency across servers.

Asynchronous Replication diagram

Logical Replication

Logical replication allows selective data replication at the table level, offering the flexibility to replicate specific tables or rows. This is beneficial when upgrading systems with minimal downtime and when integrating data across different PostgreSQL versions.

Replication Use Cases

Replication in PostgreSQL supports various use cases, including:

  • High Availability: Ensuring continuous database operation by automatically failing over to a standby database if the primary database fails.

  • Data Latency: Minimizing the delay in data replication across geographical locations to ensure timely access to data.

  • Data Residency: Replicating data to servers in specific locations to comply with legal or policy requirements regarding data storage.

  • Near Zero Downtime for Major Upgrades: Upgrading PostgreSQL versions with minimal service interruption by replicating data to a newer version instance and switching operations to it.

  • Load Balancing / Query Routing: Distributing read queries among several database replicas to optimize performance and resource utilization.

  • Data Migration: Moving data from one database to another, potentially across different storage systems, locations, or schemas.

  • ETL (Extract, Transform, Load): Using replication to extract data from operational databases, transform it as needed, and load it into a data warehouse or analytical system.

  • Data Warehousing: Aggregating data from various sources into a central repository to support business intelligence and reporting activities.

Physical Replication

Physical replication in PostgreSQL involves copying and synchronizing data from a primary server to standby servers in real time. This method ensures data consistency and up-to-date replicas by transferring real-time Write-Ahead Log (WAL) records. Standby servers can operate in hot standby mode, handling read-only queries while replicating changes.

Physical replication can be configured as synchronous, ensuring strict data integrity, or asynchronous, improving write performance. Physical replication facilitates automatic failover by promoting a standby server to primary in case of primary server failure.

Logical Replication

Logical replication involves copying data objects and changes based on replication identity, offering fine-grained control over data replication and security. It follows a publisher/subscriber model where one or more subscriber nodes subscribe to one or more publisher nodes. Data is copied in a format interpretable by other systems using logical decoding plugins.

Setting Up Logical Replication

To configure logical replication, modify your PostgreSQL instance by setting the following parameters in the postgresql.conf file:

wal_level = logical

max_replication_slots = 4

max_wal_senders = 4

After modifying postgresql.conf, restart the PostgreSQL server so the changes take effect.

Creating a Publication:

On the primary server, create a publication to specify the database changes to replicate. For example:

CREATE PUBLICATION my_publication FOR ALL TABLES;

Or for specific tables:

CREATE PUBLICATION my_publication FOR TABLE table1, table2;

Creating a Subscription:

On the subscriber database, create a subscription to the publication, initiating the replication process:

CREATE SUBSCRIPTION my_subscription CONNECTION 'host=source_host port=5432 dbname=source_db user=replicator password=secret' PUBLICATION my_publication;

PostgreSQL Logical Replication Slots

Logical replication slots ensure efficient replication by retaining changes until all subscribers confirm that a transaction has been replicated. They play a crucial role in managing database changes in replication systems, allowing for seamless data streaming and maintaining data integrity and consistency across multiple subscriber nodes. Monitoring and managing replication slots are essential for ensuring seamless replication and data consistency.

Active-Standby vs. Active-Active Replication

Active-Standby

In an active standby configuration, one primary server handles write operations while standby servers handle read operations. This ensures high availability, load balancing, and a low risk of data loss when using external tools for automatic failover.

Active Active

Active-active replication (as offered by pgEdge)  involves multiple primary servers (also known as master nodes), replicating data between each other. Unlike the traditional master-slave (or primary-standby) configurations, active-active replication allows all nodes to handle read and write operations simultaneously. This configuration enhances write availability and scalability by distributing read and write transactions across multiple nodes.

Benefits of Active Active Replication

  • Simultaneous Writes: Multiple nodes can accept write operations simultaneously, increasing the overall write throughput of the system.

  • Conflict Detection and Resolution: Since multiple nodes can write simultaneously, conflicts can occur when different nodes modify the same data. Efficient conflict detection and resolution mechanisms are essential.

  • High Availability: In the event of a node failure, other nodes can continue to process writes, ensuring continuous availability.

  • Data Residency and Latency: By distributing nodes geographically, data can be stored closer to the users, reducing latency and meeting data residency requirements.

  • Near Zero-Downtime Upgrades: Shifting traffic between nodes during the upgrade process allows for seamless upgrades with minimal downtime.

Implementing Active Active Replication in PostgreSQL

PostgreSQL does not support active-active replication natively. However, it can be achieved using third-party tools and extensions such as pglogical and pgEdge Spock.

pgEdge Distributed Postgres

pgEdge Distributed Postgres is another powerful tool for active-active replication in PostgreSQL, designed to handle high transaction volumes and ensure minimal latency. pgEdge Distributed Postgres and its Spock extension is available as an open-source project, as a graphical interface for your Cloud clusters, or as a command-line tool (pgEdge Platform) that simplifies the installation and management of your custom active active cluster.  (A fully hosted and managed option is also available through pgEdge Cloud).

Key Features of pgEdge Distributed Postgres:
  • Bi-directional Replication: Supports bi-directional replication, allowing data to flow between multiple masters.

  • Conflict Resolution: Provides advanced conflict resolution mechanisms, including user-defined resolution strategies.

  • High Performance: Optimized for high performance, ensuring efficient data replication with minimal overhead.

  • Scalability: Scales easily accommodate additional nodes and increased data volumes.

Configuring pgEdge Distributed Postgres for Active-Active Replication

Install the pgEdge Spock Extension:

CREATE EXTENSION spock;

Configure the Nodes: Configure the necessary settings in postgresql.conf for each node:

wal_level = logical

max_replication_slots = 10

max_wal_senders = 10

Create a Spock Node: Define each node in the Spock cluster:

SELECT spock.node_create(node_name := 'node1', dsn := 'host=localhost dbname=mydb');

Add Replication Sets: Define the tables and sequences to be replicated:

SELECT spock.replication_set_add_all_tables('default', ARRAY['public']);

Create Subscriptions: Set up subscriptions on each node to subscribe to replication sets from other nodes:

SELECT spock.subscription_create(sub_name := 'sub1', provider_dsn := 'host=node2 dbname=mydb', replication_sets := ARRAY['default']);

Manage Conflicts: Configure conflict handling strategies using Spock’s conflict resolution features:

SELECT spock.alter_node_add_interface(node_name := 'node1', interface_name := 'iface1', dsn := 'host=node1 dbname=mydb');

pglogical

pglogical is a logical replication system for PostgreSQL, providing a robust active-active replication framework. It uses a publish-subscribe model to replicate data changes between nodes.

Key Features of pglogical:
  • Flexible Replication: Allows selective replication of tables and databases, providing fine-grained control over the replicated data.

  • Cascading Replication: Supports cascading replication setups where changes can be propagated through multiple levels of nodes.

  • Customizable: Allows for custom conflict resolution logic and supports hooks for integrating with external systems.

Configuring pglogical for Active-Active Replication

Install the pglogical Extension:

CREATE EXTENSION pglogical;

Configure the Nodes: Configure each PostgreSQL node to use logical replication by setting the following parameters in the postgresql.conf file:

wal_level = logical

max_replication_slots = 10

max_wal_senders = 10

Create a Replication Set: On each node, create a replication set that defines the data to be replicated:

SELECT pglogical.create_node(node_name := 'node1', dsn := 'host=localhost dbname=mydb');

Add Replication Sets: Define the tables to be included in the replication set:

SELECT pglogical.replication_set_add_all_tables('default', ARRAY['public']);

Create Subscriptions: Create subscriptions on each node to subscribe to replication sets from other nodes:

SELECT pglogical.create_subscription(subscription_name := 'sub1', provider_dsn := 'host=node2 dbname=mydb', replication_sets := ARRAY['default']);

Handle Conflicts: Implement conflict resolution strategies as needed, using built-in options or custom logic:

SELECT pglogical.alter_node_add_interface(node_name := 'node1', interface_name := 'iface1', dsn := 'host=node1 dbname=mydb');

Conflict Resolution Strategies

Conflict resolution in MMR can be achieved through various strategies:

  • Last Write Wins (LWW): Accepting data from the most recent write based on timestamp.

  • Version Vectors: Keeping track of the version history of each data item and merging changes accordingly.

  • Operational Transformation (OT): Transforming operations to achieve a consistent state across all nodes.

  • Conflict-Free Replicated Data Types (CRDTs): Using data structures designed to handle data consistency in a decentralized manner.

  • Manual Override: Providing mechanisms for database administrators to resolve manual conflicts.

Performance Optimization in MMR

Optimizing performance using a multi-master replication cluster involves employing conflict resolution mechanisms, strict data consistency checks, and efficient data distribution strategies to balance the workload and prevent bottlenecks.

Ensuring Data Integrity

Ensuring data integrity when using a multi-master replication system involves maintaining consistency across nodes while preventing data conflicts. Best practices include implementing robust data validation mechanisms, performing regular audits, and utilizing checksums to detect anomalies. Tools like pgEdge’s Spock, pglogical, Bucardo, and PostgreSQL's built-in features validate data consistency in distributed nodes.

High Availability in PostgreSQL Using Multi-Master Replication (MMR)

Implementing Multi-Master Replication (MMR) in PostgreSQL significantly enhances high availability, ensuring continuous operation and minimal downtime. MMR allows multiple database instances to handle write operations simultaneously, providing several advantages for high availability, including fault tolerance, load balancing, and disaster recovery. Here’s a detailed look at how MMR achieves high availability in PostgreSQL.

High Availability in PostgreSQL using Multi-Master Replication Diagram

Fault Tolerance

Fault tolerance is the ability of a system to continue functioning even when one or more components fail. In a MMR setup, if one master node fails, the other nodes can continue to operate without interruption. This redundancy ensures that no single point of failure can bring down the entire database system.

Key Points:

  • Redundant Master Nodes: Multiple master nodes act as backups for each other, ensuring that the failure of one node does not affect the overall system availability.

  • Automatic Failover: Mechanisms can be implemented to automatically redirect traffic to the remaining operational nodes, ensuring seamless continuity.

Load Balancing

Load balancing provides a way to distribute database requests across multiple nodes, optimizing resource utilization and enhancing performance. By spreading the load, MMR reduces the risk of any single node becoming a bottleneck.

Key Points:

  • Write Load Distribution: With multiple masters, write operations can be distributed, reducing the load on individual nodes and preventing performance degradation.

  • Read Load Balancing: Read requests can be distributed across all nodes, including standby nodes configured for read operations, improving query response times and overall system performance.

Disaster Recovery

MMR facilitates disaster recovery by ensuring that data is replicated across multiple geographically distributed nodes. In the event of a catastrophic failure at one location, other nodes can continue to provide access to the data, minimizing downtime and data loss.

Key Points:

  • Geographical Distribution: Distributing master nodes across different data centers or regions ensures that a local disaster does not impact the entire database system.

  • Data Redundancy: Continuous replication ensures that data is up-to-date across all nodes, providing a reliable backup in case of data loss at any single node.

Conclusion

Multi-master replication in PostgreSQL offers a robust solution for high availability, fault tolerance, and scalability. Understanding the different replication methods, use cases, conflict resolution, and performance optimization strategies is essential for effectively implementing and managing MMR in PostgreSQL. By leveraging the right tools and techniques, organizations can ensure seamless data replication, maintain data integrity, and achieve their operational goals with minimal downtime and maximum efficiency.

 To learn more, watch the full conference presentation or download the slides here.  To give active-active replication with pgEdge distributed Postgres, go to https://www.pgedge.com/get-started/cloud.