Introduction
PostgreSQL stands as a resilient and adaptable open-source relational database management system. In an era where data requirements constantly shift, organizations are increasingly delving into the realm of distributed databases to unlock scalability, enhance high availability, and boost overall performance. pgEdge Platform is one such solution that empowers you to implement a fully-distributed multi-master PostgreSQL environment. In this comprehensive guide, we'll walk you through the process of installing pgEdge Platform. pgEdge Platform is self-hosted, self-managed downloadable distributed PostgreSQL database. pgEdge also offers pgEdge Cloud, a fully managed database-as-a-service (DBaaS) – but this post focuses on pgEdge Platform.
Understanding Distributed PostgreSQL
PostgreSQL, by its fundamental design, operates as a single-node relational database that resides on a single host. In this configuration, all data resides on a solitary node; even partitioned tables are likely to remain in close proximity.
To adapt PostgreSQL for use in distributed environments, third-party extensions come into play. These extensions offer functionalities like connection pooling, load balancing, and sharding, and when combined, allow a PostgreSQL database to span multiple servers. The distributed configuration grants an organization the power of scalability, ensuring that as data volumes grow, the database can seamlessly expand to accommodate the load. Moreover, high availability becomes a reality, as data redundancy across nodes minimizes the risk of a single point of failure, guaranteeing uninterrupted service. Distributed PostgreSQL also tackles key challenges, such as mitigating latency and optimizing query performance by storing data closer to where the data is needed, ultimately enhancing the user experience and system efficiency. It's vital to note that PostgreSQL isn't inherently tailored for distributed systems, and utilizing these extensions introduces inherent challenges, including the complexity of managing such a setup and potential performance trade-offs.
Introducing pgEdge Platform
pgEdge Platform is the most complete and advanced distribution of open-source fully-distributed PostgreSQL, supporting multi-master (active-active) replication.
Database nodes operating with the pgEdge Platform have the ability to join clusters that extend across data centers and major cloud providers such as AWS, Azure, and Google Cloud, and can deploy on-premises on Equinix Metal. pgEdge Platform seamlessly functions on a wide range of standard hardware and operating systems.
If you’re a Docker fan and would like to orchestrate a quick container-based two-node multi-master setup, see the instructions on this blog by my colleague Cady Motyka.
A high-level architecture of pgEdge Platform
Preparing Your Environment
The prerequisites for installing pgEdge Platform are:
A local server or cloud instance running RHEL/CentOS/Rocky 9 or Ubuntu-22.04, with network traffic allowed on port 5432. AWS Cloud users can find the recommended Rocky 9 ami here.
Passwordless sudo privileges via a non-root user
Passwordless SSH access into the server via a non-root user
Installing pgEdge Platform
Installation is a simple 2-step process:
In any directory owned by non-root user, invoke the following command to create the pgedge directory and install nodectl:
python3 -c "$(curl -fsSL https://pgedge-download.s3.amazonaws.com/REPO/install.py)"
Then, move into the pgedge directory and initialize the database cluster:
cd pgedge
./nodectl install pgedge -U superuser_name -P superuser_password -d database_name
Replace the italicized placeholders in the above command with:
● a name for the database superuser
● a password for the database superuser
● a name for the database.
Names cannot be the name of an operating system superuser, 'pgEdge', or any of the PostgreSQL reserved words.
Configuring PostgreSQL Instances for Multi-master Replication
NodeCTL is the pgEdge Command Line Interface (CLI), used for installing, deploying, and managing components. The Spock extension provides multi-master (multi-active) replication for PostgreSQL versions 15 and 16. NodeCTL provides commands that allow you seamless interaction with the Spock extension.
Using NodeCTL on each node, create the Spock components needed for replication. Here's the flow:
First, we create a spock node. In our command, we provide a name for the node and a connection string that includes the network address, the name of an operating system user with root privileges (in our example, admin), and the database name (demo). The connection string is also followed by the database name. For example:
./nodectl spock node-create <node name> '<dsn>' <dbname>
Then, we create a replication set. In our command, we provide the replication set name and the database name. The node name (n1) and the replication set name (demo_replication_set) can be set to any valid value you choose, but you will have to reference them in future commands. For example:
./nodectl spock repset-create <replication_set_name> <dbname>
Then, we create the subscriptions. In our command, we provide a unique subscription name for each node, followed by a connection string that specifies the network address for the other node in the subscription. For example:
./nodectl spock sub-create <subscription_name> '<dsn>' <dbname>
High Availability and Failover
The multi-Cloud, multi-master (active-active) architecture of pgEdge inherently ensures extremely high availability in a straightforward manner. If one or more nodes become unavailable, traffic seamlessly shifts to the remaining active nodes, eliminating any delays or lost transactions associated with passive nodes deliberating to elect a new active node. With pgEdge, disaster recovery and failover are seamlessly integrated, making it easy to plan and deploy. Unlike other high availability solutions, pgEdge significantly simplifies cross-cloud region failover support. For mission-critical applications, the idea of telling users, "Sorry, but our cloud provider's Eastern region is experiencing issues," is simply not acceptable. And the best part: zero downtime during maintenance!
Data Backup and Recovery
Backup and recovery is a critical concern for any database management platform, and in the case of the pgEdge Platform, we understand these functions are of paramount importance. We have semi-automated and designed pgEdge Platform with strict attention to Recovery Point Objective (RPO) and Recovery Time Objective (RTO) requirements. By customizing backup schedules and utilizing efficient recovery mechanisms, organizations can align their backup and recovery strategies with their RPO and RTO goals, ensuring that they meet their specific data recovery needs and minimize potential disruptions.
Using NodeCTL to help with management is really straightforward - just install the backrest extension (our forked version of pgBackRest), and boom, you're ready to take backups. For example:
./nodectl install backrest
Conclusion
Distributed PostgreSQL, facilitated by pgEdge Platform, is a powerful solution for organizations seeking to unlock the full potential of their relational databases. This comprehensive guide has provided you with the knowledge and steps necessary to install, configure, and start using pgEdge Platform, the fully-distributed PostgreSQL environment. As data continues to play a central role in business operations, the ability to scale, maintain high availability, and optimize performance is critical. By following the steps outlined in this guide, you're well on your way to harnessing the benefits of distributed PostgreSQL with confidence and expertise.
Additional Resources
Author: Hari Kiran