How to migrate to Redis from ElastiCache with zero downtime

Freshservice, the cloud-based ITSM service desk software from Freshworks’ suite of products, helps companies manage incidents, assets, and more ITIL features.

Freshservice uses Redis as a datastore. We had the need for a datastore with fast write and read functionality and Redis provided it. A few examples are feature flags, counters, and sidekiq for asynchronous job processing which relies completely on Redis.

When we launched the product, we self-hosted the Redis servers. We got our hands dirty, and manually scaled instances monitored and maintained updates. Needless to say, all the manual intervening became tedious. We moved to AWS ElastiCache that allowed seamless maintenance of Redis instances and managed the servers for us. It was easy to deploy and use Redis in the cloud. However, it had limitations in upgrading the Redis version without incurring any downtime. We wanted a better solution that allowed scaling horizontally, and most importantly, zero downtime while upgrading the Redis versions. 

We decided to migrate to Redis, which is the most popular in-memory database. It is a highly available, performant, reliable, and seamlessly scalable service.  

However, we found that ElastiCache did not support the SLAVEOF command that would have enabled us to change the replication settings of a replica/slave on the fly. We would have to manually take snapshots to migrate the data from ElastiCache to Redis, causing downtime. In turn, we would be experiencing service disruption since some of our critical features are depending on it.

In a world where there is a need for everything to run instantly, we wanted to migrate from ElastiCache to Redis with zero downtime. 

Here’s how we did it:

The gist of the solution

  • Start monitoring the changes being made in the Redis server. Redis allows MONITOR command that streams back every command processed by the Redis server
  • Create a point-in-time snapshot of the source Redis server
  • Apply the changes from the monitor command to the destination servers to keep the servers in sync
  • Verify random sets of data on both servers and check for consistency
  • Switch traffic to the destination servers

Monitoring changes made to the Redis server

The Monitor command gives us all commands that were executed in the server. The output received is pushed into SQS to maintain the operations performed in the source Redis server. The following script pushes the output of monitor command of the source Redis (ElastiCache) into SQS (can use any other queue service also).

Note: The SQS FIFO queue has a limitation of 300 messages per minute, which might not be enough. The standard queue can have message duplications / out of order messages. One needs to ensure whether this is acceptable to have. 

Creating a Redis snapshot

The timestamp is noted and all keys in source Redis are copied into the destination. This will be taken as the point-in-time copy of your server into the destination.

Keeping servers in sync

Now that we have a point-in-time copy of our source Redis server in destination,  we use the following script to apply the changes available in SQS.

The script polls the commands pushed to SQS and executes them on the destination server. As we have a point-in-time copy in the destination, we only need to poll the commands pushed after the timestamp we had recorded in Redis. 

Once messages in the queue become ~ 0, destination redis server will become a live replicating slave of the ElastiCache source Redis server. 

Verifying data and checking for consistency

With live replication taking place, the following script is run to compare random keys in both servers to evaluate the migration.

Traffic switch

Once we have verified that the live replication is taking place smoothly, we can switch the traffic to the destination Redis server and then stop the replication. Once the app is stable with the new Redis server, a final snapshot of the old Redis server is taken before deletion in case we were to consider a rollback in the future. 

Our learnings

Once all the cases had been considered for the database migration, the above steps were executed. During migration, we had to simultaneously boot 10 sync scripts to catch up with our traffic in our US data center and achieve live replication. Also, the keys that are updated frequently (eg. counter variables) were identified as non-matching keys in the verification script. The script picks a key, gets the value from the source server first and then from the destination. There is a possibility of a race condition, in the time that the script is being run, the value at the source server changes and syncs in the destination. In such cases, we had to manually ensure.

Overall, the migration was successful – it was done with live production traffic and was completed with zero downtime.