Zero-downtime PostgresQL upgrade

The popular open-source object-relational database system is ending support for version PostgreSQL 9.6 after just over five years. As a follow-up to that announcement, Amazon Relational Database Service (RDS) has also shared their end-of-life (EOL) plan. To align with the PostgreSQL 9.6 EOL, Freshworks made a business decision to upgrade to PostgreSQL 13.4 version.  

We established a number of requirements before proceeding with the upgrade in production. Here are a few key requirements: 

  • Zero downtime upgrade
  • Automated migration
  • Minimal replication lag

We ruled out upgrade options such as in-place upgrade or a snapshot-based approach, as both of these result in downtime for the business. It was a unanimous decision to make use of logical replication using the pglogical extension. Using this approach, downtime is only for the time it takes to point the application to the new database instances.

Logical replication is a method of replicating data objects and their changes based upon their replication identity (usually a primary key). This is in contrast to physical replication, which uses exact block addresses and byte-by-byte replication. Logical replication allows you to stream changes from a database in real time to a target database. This replication sends data to the replica in a logical format. Additionally, it can send data for a single table, database, or specific columns in a table.

The PostgreSQL major version upgrade project was planned in various phases to accomplish a smooth execution in production:

Milestone 1: Ensure prerequisites are met

Milestone 2: Automation 

Milestone 3: Test application and all clients against new major version 

Milestone 4: Upgrade test using staging snapshots  

Milestone 5: Upgrade test using production snapshots

Milestone 6: Migrate staging, load test, and production environments

Prerequisites

To complete the PostgreSQL upgrade using the pglogical extension, we took care of the following prerequisites:

Primary key

Tables must have a primary key or unique key for replicating data changes. The following query lists tables in a PostgreSQL database without primary keys:

select tab.table_schema,

       tab.table_name

from information_schema.tables tab

left join information_schema.table_constraints tco 

          on tab.table_schema = tco.table_schema

          and tab.table_name = tco.table_name 

          and tco.constraint_type = ‘PRIMARY KEY’

where tab.table_type = ‘BASE TABLE’

      and tab.table_schema not in (‘pg_catalog’, ‘information_schema’)

      and tco.constraint_name is null

order by table_schema,

         table_name;

In production, primary keys were created for the missing tables without downtime. As part of this exercise, we used a few techniques such as the one shown below, thereby ensuring no downtime for the application.

CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS table_name_pkey ON <tblname> USING btree (<colname>);

ALTER TABLE <tblname> ADD PRIMARY KEY USING INDEX table_name_pkey;

pglogical extension

In Amazon RDS for PostgreSQL, the pglogical extension is supported from version 9.6.11. As part of regular maintenance, production databases were already moved to 9.6.15. The server version can be checked using the below command:

                            SHOW server_version;

The following parameters were set to enable the pglogical extension on the source database:

wal_level = ‘logical’

track_commit_timestamp = on

max_worker_processes = 10

max_replication_slots = 10 

max_wal_senders = 10

shared_preload_libraries = ‘pglogical’

rds.logical_replication=1

Please note that the above values will vary based on the number of databases and replication sets. Changes in these values will require a reboot of the database server. In multi-AZ deployments, instances won’t be unavailable for the time it takes the instance to failover (usually about 60 seconds). This is the only step that required some unavailability in services, but it was unavoidable. Pglogical works on a per-database level, not the whole server level like physical streaming replication. Hence, logical extension need to be created for each database.

pgBouncer

In our deployment, pgBouncer was deployed as a connection pooler and ensured primary and replica server traffic flowed through this connection pooler. This helped us switch application server connections to the latest PostgreSQL versions seamlessly. Here is the high-level production deployment model:

Zero-downtime PostgresQL upgrade
The figure above represents the traffic and HTTP stats of production servers during the traffic switch time.

Automation

We have automated several tasks as follows: 

  1. Creation of PG 13 parameter group
  2. Create the PG 13 target database
  3. Generate a script to create users 
  4. Configure the WAL parameters
  5. Create the pglogical extension
  6. Create the source (provider) node
  7. Generate script dumps based on source database
    • Create table script
    • Create primary keys
    • Create indexes
    • Create constraints
  8. Execute script to create the tables on the target database
  9. Create the replication set
  10. Prepare the target node for replication
  11. Create a subscription node on the target database
  12. Verify if the subscription is active
  13. After initial copy, create primary keys, indexes, and constraints
  14. Verify the data replication is running
  15. Setting sequences
  16. pgBouncer changes to switch traffic
  17. Drop the subscription

As part of this migration, all the standard steps were followed to create source (provider) node, replicationset, prepare target node, create subscription, and enable logical replication. We ensured that there were no long-running autovacuum jobs during the creation of the replication set. During the testing phase using snapshots, it was observed that the larger database’s initial copy took longer. We came up with a few optimizations to reduce overall copy time.

Copy optimization

While performing internal test executions using production database snapshots, we found that the COPY operation took more than five days for one of the databases. This database size was more than 1.5 TB and had approximately 2.35 billion rows. A replication period exceeding five days is not acceptable for us, as it will increase the overall replication period along with the increase in disk space due to WAL files on the publisher side. Hence, we took extra steps to optimize the COPY operation time, and made the following changes:

  • Change database parameters and revert them after copy
    • max_wal_size value was increased
    • checkpoint_timeout value was increased
    • Turn off the autovacuum
  • Copy tables data without any primary keys, constraints, or indexes

After the copy operation, the subscription was paused to create the needed primary keys, indexes, and constraints. The maintainance_work_mem value was increased to speed up the creation of indexes and primary keys. In case of large databases, these operations were performed in parallel mode, like creating indexes on multiple tables using different sessions.

Below is the sample SQL code snippet used to generate the scripts for the creation of primary keys, indexes, and constraints after the copy operation:

SELECT ‘ALTER TABLE ONLY ‘ || conrelid::regclass || ‘ ADD CONSTRAINT “‘ || conname || ‘” ‘ || pg_get_constraintdef(oid) || ‘;’

FROM pg_constraint

WHERE contype IN (‘p’)

AND connamespace = ‘public’::regnamespace;

 

SELECT ‘ALTER TABLE ONLY ‘ || conrelid::regclass || ‘ ADD CONSTRAINT “‘ || conname || ‘” ‘ || pg_get_constraintdef(oid) || ‘ NOT VALID;’

FROM   pg_constraint

WHERE  contype IN (‘f’) AND connamespace = ‘public’::regnamespace;

 

select indexdef || ‘;’ from pg_indexes

where schemaname = ‘public’ and indexname

not in (select conname from pg_constraint where contype IN (‘p’,’u’));

 

SELECT ‘CREATE UNIQUE INDEX IF NOT EXISTS ‘ || conname || ‘ ON ‘ || conrelid::regclass || SUBSTRING(pg_get_constraintdef(oid) from 8) || ‘;’

FROM   pg_constraint

WHERE  contype IN (‘u’)

AND    connamespace = ‘public’::regnamespace;

 

SELECT ‘ALTER TABLE IF EXISTS ONLY ‘ || conrelid::regclass || ‘ ADD CONSTRAINT ‘ || conname || ‘ UNIQUE USING INDEX ‘  || conname || ‘;’

FROM   pg_constraint

WHERE  contype IN (‘u’)

AND    connamespace = ‘public’::regnamespace;

 

SELECT ‘ALTER TABLE ONLY ‘ || conrelid::regclass || ‘ ADD CONSTRAINT “‘ || conname || ‘” ‘ || pg_get_constraintdef(oid) || ‘ NOT VALID;’

FROM   pg_constraint

WHERE  contype IN (‘f’) AND connamespace = ‘public’::regnamespace;

 

SELECT ‘ALTER TABLE ONLY ‘ || conrelid::regclass || ‘ VALIDATE CONSTRAINT “‘ || conname || ‘”;’

FROM   pg_constraint

WHERE  contype IN (‘f’) AND connamespace = ‘public’::regnamespace;

Below is the code snippet to disable the subscription immediately after the completion of copy operation:

SET statement_timeout=0; 

select pglogical.wait_for_subscription_sync_complete(‘$SUBSCRIPTION_NAME’); 

select pglogical.alter_subscription_disable(subscription_name:=’${SUBSCRIPTION_NAME}’, immediate:=false);

This function call will wait for the subscription to asynchronously start replicating and complete any needed schema and/or data sync. After disabling the subscription, executed script for creation of primary keys, indexes, and all the needed constraints.

With the above approach, we managed to bring down the initial COPY operation time from more than five days to a few hours for large databases.

After the successful COPY operation, enabled autovacuum and tables were analyzed successfully. Before opening up for traffic, ensured observability is in-place for all the newly migrated instances and created the needed read replica instances.

Traffic switch

In production, pgBouncer was deployed for connection pooling and connecting all application servers to database servers.

We decided to perform a traffic switch during non-peak hours. As part of production optimization settings, idle_in_transaction_timeout was one of the values that was kept on the server for a very long time. This parameter helped ensure that there were no long-running transactions on the database server. 

Here is the list of high-level steps that were performed as part of the traffic switch after ensuring the replication lag is very minimal:

  • Setting sequence
  • PAUSE pgBouncer
  • Flush all replication pending traffic
  • RELOAD pgBouncer
  • RESUME pgBouncer
  • Drop subscription

Sequence

Gathered sequence growth trend for all tables in the production database servers. Based on the trend, executed script to set sequence value with the needed delta on the target database.

PAUSE pgBouncer

We decided to perform a traffic switch during non-peak hours. pgBouncer must execute an ongoing transaction to complete the PAUSE command for all connections. In production, we ensured that idle_in_transaction_timeout was less than 30 seconds, ensuring there were no long-running transactions on the database server.

Flush all replication pending traffic

Pglogical provides a function ‘SELECT pglogical.wait_slot_confirm_lsn(NULL, NULL)’ to ensure all subscribers have received changes up to a certain point on the provider. In this case, it ensures that lag on the server is zero.

RELOAD pgBouncer

Once we stopped all the traffic to the PostgreSQL 9.6 instance, we needed to update the PG 13.4 server names in the pgBouncer configuration and execute reload. With that change, pgBouncer started to send traffic to the latest PostgreSQL versions.

RESUME pgBouncer

PgBouncer RESUME will ensure traffic flows through the PostgreSQL 13.4 servers. Between PAUSE and RESUME, pgBouncer will not serve any SQL traffic, hence application servers will be waiting for the connections from pgBouncer. Therefore, we need to ensure that this duration is as short as possible to avoid any application hanging issues. 

The above steps were repeated for the database servers in the production environment, which helped us achieve a zero downtime upgrade of the PostgreSQL server.

In conclusion

In this post, we shared our experience on how we migrated Amazon RDS PostgreSQL 9.6 to a major version 13.4 with zero downtime using the pglogical extension. We strongly believe that detailed planning, automation, and multiple test executions using production snapshots helped us achieve our goal. Before the f migrations, we made sure our backups were all good, our monitoring system was all green, and all maintenance tools and scripts were adjusted appropriately before the traffic switch. 

We have many projects underway on the rigorous road to reliability, and we need contributions from amazing people to get there. If site reliability engineering at Freshworks interests you, let’s have a conversation about the projects we are working on and opportunities for you to become part of our efforts toward four nines.

Veera Chenna co-authored this piece. Veera is a lead software engineer for site reliability at Freshworks.