Etched in memory: How we designed our DBaaS backup and recovery process

[Part I and II of this blog series gave an overview of our DBaaS design and described our DB fail-over handling. In this blog (Part III), we will discuss the motivation and high-level architecture of the DBaaS Backup and Recovery procedure for databases hosted on DBaaS. This blog discusses the motivations behind the Backup/Recovery process and it’s high-level design and architecture of the various components that constitute it.]

Design Goals

These are the high-level motivations/goals for the Backup and Recovery design for DBaaS.

  • The backup and recovery process is designed to have maximum data durability, and the ability to recover from the loss of one or more of the DB servers without losing data. 
    • The general goal is to maximize RPO(Recovery Point Objective) and minimize RTO (Recovery Time Objective), as much as possible.
    • The backup mechanism should be flexible to have customizable backup intervals depending on the requirements of the application using the service.
    • The recovery mechanism should be fast and robust enough to create a new DB instance from DB backup quickly, and join the new instance to an existing DB shard using replication. 
    • Point in Time Recovery(PITR) should be supported so that it is possible to create a new DB instance from the backup with data upto a specific point in time.

Design Components

There are two major design components:

  1. Backup Lambda
  2. Recovery Automation Scripts

Backup Design

The High level architecture of DB backups is depicted in the diagram below.

  • DB backups are triggered using an AWS lambda function we call the Backup Lambda Function.
  • The Backup process is managed by Backup lambda function at regular intervals specified by the time in the AWS Event Bridge cron job specification.
  • Each DB shard is created with Event Bridge specifications for the invocation interval of the Backup Lambda function.
  • The orchestrator is an open source tool from Github to help in managing MySQL clusters with high availability and replication management. In DBaaS, the orchestrator is used for providing quick fail-over to the candidate primary if any issues are detected in the primary server. ( In DBaaS, we have high-availability cluster of orchestrator nodes in three availability zones for robustness. 

We have two different types of backups  supported in DBaaS.

  • Automated regular Backups:

     Regular periodic backups triggered using Lambda functions in scheduled intervals.

  • On-demand Backups:

     On demand backups are triggered by DBA using Rundeck jobs whenever needed. The Rundeck jobs will trigger the DB backup Lambda to take a DB snapshot at the current time.

Backup Lambda Function Design

   In DBaaS, DB Backups are taken using the “backup” AWS Lambda function, hosted in the DBaaS VPC. A new invocation of this lambda function is scheduled for each DB shard creation using terraform templates. The following diagram depicts the overview of the backup lambda’s design for a particular DB shard.

  • The Lambda functions are created from the terraform templates for creating the shard. The AWS Event Bridge trigger for the lambda is set up to trigger the lambda function at a specified time using the cronjob expression. 
    • For example, this expression cron(55 23 * * ? *) sets up the lambda to be triggered every day at 23:55 (5 minutes before midnight) .
  • The backup lambda function is implemented in Python 3. It uses AWS Boto3 API to take DB snapshots.
  • Backup lambda function has 2 layers: 
    • The first layer is the Python code for implementing the backup functionality.  
    • The second layer contains the “MySQL” client package, which is necessary for communicating with MySQL server.
  • Backup lambda function takes one argument: the id of the AWS secret manager entry, which contains all the necessary information needed by the lambda like the credentials to connect to the DB, the orchestrator cluster name, and the AWS region for backup.
  • Backup lambda first chooses the DB instance from the DB Shard (Primary, candidate-primary, slave) for taking the backup. This is done by calling the Github Orchestrator API to get the right instance from the cluster.
    • If the candidate primary is available, then it will be the choice for taking the backup.
    • If for any reason the candidate primary is not available, then the primary will be chosen.
  • Once the backup DB instance is chosen, then the backup lambda will open a MySQL session to that DB server.
    • Take a read lock and flush the tables using the SQL command:
      • This will ensure all the transactional information in memory is flushed to the EBS volume attached to the instance to make sure there are no writes happening when the DB snapshot is triggered.
    • Get the MySQL GTID of the DB server at this moment using this SQL command:
      • select @@GLOBAL.gtid_executed
      • This gets the GTID of the current transaction executed so far.
  • Once the above step is completed, then the DB is ready for backup. At this point, the backup lambda will trigger a backup of MySQL EBS volume by taking the EBS snapshot of the MySQL EBS volume.
  • Once the EBS snapshot is started, then the DB server should be unlocked to continue normal DB operations. This EBS snapshot trigger takes only a few milliseconds.
  • The EBS snapshot will be tagged with the following information.
    • GTID: <GTID of the DB server when the snapshot is triggered>
    • Cluster-name: <name of the orchestrator cluster of the DB instance>
  • These tags are important for recovery from snapshots into DB instances. The GTID(Global Transaction ID) is the unique identifier for transactions in multiple DB servers in a cluster. This helps in setting up replication easily during recovery.
  • EBS snapshots are done at the block level and they are incremental. Only the changed blocks from the previous snapshots are backed up. So, the time/cost of the snapshot is based on how much data has changed from the last snapshot taken from the EBS volume.
  • The EBS volume can still be used for I/O when the EBS snapshot is in progress. Once the EBS snapshot begins, it’s no longer affected by changes in the EBS volumes.

Recovery Design

Recovery is the process to re-create DB instances from the backup snapshots and joining them in a DB cluster with replication setup. This is done by Ansible Playbooks in DBaaS. The recovery process should make sure to avoid any replication errors, and ensure the integrity of data. The diagram below shows an overview of how Recovery of a DB instance from snapshot happens.

 The recovery process involves two important steps.

  1. DB Instance Creation:

DB instance and EBS volume from snapshot using Terraform templates.

      2. DB Instance Configuration:

Ansible playbook to configure the MySQL volume and setup replication. 

Recovery DB instance creation with Terraform 

DBaaS uses Terraform as the way to create any new infrastructure. The recovery process involves two critical steps in creating the infrastructure:

  1. A new EC2 instance will be created for running the new DB server using the Mysql Terraform module. This module places the EC2 instances created from it in different AWS Availability Zones (AZ) so that the DB instances are placed across different AZs for ensuring High Availability.

      2. Creating a new MySQL volume from the DB backup EBS snapshot. The module takes EBS snapshot id as a parameter variable. It will create an EBS volume from the EBS snapshot and attach it with the DB instance created for recovery.

Recovery Rundeck Jobs:

Recovery of DB instances from snapshots are implemented as Rundeck jobs that are used by DBA.

  • Rundeck jobs are used in DBaaS to automate routine DBA activities grouped under different categories like Backup, Recovery, etc.
  •  Rundeck jobs internally call Ansible playbooks to configure the MySQL DB  instances and setup the replication with the current MySQL primary in the shard.
  • There are two steps in running recovery Rundeck jobs:
    • Find the latest snapshot taken for the DB shard in which recovery needs to be done using the rundeck job for this.
    • Run the recovery job using the latest snapshot taken above.

Recovery Ansible Playbooks:

In DBaaS, Ansible Playbooks are used for configuring MySQL and setting up replication among other tasks.

  •  mysql  playbook is used to configure the MySQL database with default configurations and also for setting up MySQL replication with MySQL primary.
  • The mysql playbook finds the GTID of the DB snapshot for recovery and uses it for setting up replication. This GTID contains the GTID of the MySQL server when the snapshot was taken, which uniquely identifies the transaction that was completed when the DB snapshot was taken. This is necessary for replication setup.
  • The following replication-related MySQL commands are used for setting up the client.

This command resets the previous replication information, if any, from the snapshot recovered, so that new replication information can be configured.


This command sets the current GTID to the last committed GTID in the MySQL volume recovered from the snapshot. This is important so that the primary DB can stream binlogs from this point onwards.

    • CHANGE MASTER:  This command is used to join the recovered instance to the current MySQL primary of the DB Shard.


  • Once the replication is set up correctly, the primary DB will stream binlogs to the newly created replica. This may take some time depending on the amount of binlogs that need to be streamed from the primary DB. This replication progress can be monitored using SHOW SLAVE STATUS command. 


Regular backups and a robust recovery mechanism are essential for any database management system. By having an adaptive backup policy that can be customised based on the application’s needs and ability to create DB instances from backups quickly, we ensure the smooth running of DBaaS with maximum durability and robustness.

Cover image: Vignesh Rajan