How we perfected the design for our DBaaS disaster recovery mechanism

Part I , II and III of this blog series gave an overview of our DBaaS design, fail-over handling and backup mechanism respectively. In this blog (Part IV), we will discuss the motivation and high-level architecture of the DBaaS Disaster Recovery(DR) mechanism for databases hosted on DBaaS.

Design Goals

These are the high-level motivations/goals for the Disaster Recovery design for DBaaS. 

  • A Disaster event could be any event that makes databases hosted in one or more Availability Zones(AZ) or a whole region become inaccessible for applications. The Disaster Recovery(DR) process should be designed to have maximum data durability and the ability to recover from the loss of a complete AWS region or one or more AWS availability  zones(AZ) in a region.
  • The motivation for the DR mechanism is to maximize RPO(Recovery Point Objective) and minimize RTO (Recovery Time Objective) for the databases, as much as possible.
  • The DR mechanism should be flexible to have customizable backup intervals depending on the requirements of the application using the service. The recovery mechanism should be fast and robust enough to create a new DB instance from DB backup quickly,
  • Point in Time Recovery(PITR) should be supported so that it is possible to create a new DB instance from the backup with data upto a specific point in time.

Architecture

DBaaS Disaster Recovery mechanism

Following are the components of DBaaS Disaster Recovery architecture: DR Backup services and DR Recovery mechanism explained below.

DR Backup Services

Backup Services are responsible for backing up the DB data from MySQL nodes in both the current host region as well as in the DR region.  This mitigates the risk of losing an entire AWS region due to a disaster event, and data is always available in more than one geographical region to prevent loss of data. DR Backup services are deployed in all the MySQL nodes for all shards as “systemd” services. Backup process consists of two services: DB snapshot service and Binlogs to S3 service explained below.

DB Snapshot Service

Snapshot service, as the name suggests, takes a point-in-time snapshot of the data in the DB for backup. DB snapshot service is a systemd service deployed in all MySQL nodes of a DB cluster, regularly backing up the node’s DB volume. This service is responsible for taking both periodic and on-demand backups of MySQL databases. 

DB Snapshot service has two primary modes of backup:

  • Regular Backups
  • Regular backups will be triggered as per a cron schedule provide by DBA at the service deployment time: The default schedule is “55 23 * * *”, which triggers snapshots at 23:55 PM every day. This can be configured at any time to give a new cron schedule as per the application requirements.
  • On-demand Backups
  • On-Demand backups will be taken whenever an explicit snapshot request is issued by DBA for a particular shard. This is a point-in-time snapshot of the state of the DB at that point. The user/DBA can initiate this backup at any point in time.
  • DB snapshot service is deployed into all the nodes of a MySQL shard, but it is active only on the Candidate Primary node of the cluster to get backups done. If there is no candidate primary node in the cluster, then the backup will be triggered in the primary node of the cluster.
  • DB Snapshot service determines the type of the node in MySQL shard using Orchestrator.  The orchestrator  (https://github.com/openark/orchestrator) is an open source tool from Github to help in managing MySQL clusters with high availability and replication management. In DBaaS, the orchestrator is used for providing quick fail-over to the Candidate Primary if any issues are detected in the primary server.
  • DB Snapshot service takes backup using EBS snapshots after locking the tables and flushing any pending transactions using the method described in part III of this blog series. It maintains the state of the process in a file S3, and will restart from the previous state if there are any interruptions during the snapshot process and copy it to another region.

Binlogs to S3 Service

MySQL binary log(binlog) is a set of log files that contain information about data modifications made to a MySQL server for any transactions happening at that time. MySQL server maintains these log files to keep record of transactions since the last backup. Binlogs_to_S3 is a system service deployed in all MySQL nodes of a DB cluster to stream these binlog files to an S3 bucket for backup and Disaster Recovery (DR).

  • In a busy DB server, where there are many transactions happening, the MySQL server creates many binlog files to record the transactions. These binlog files need to be backed up in a S3 bucket with Cross Region Replication(CRR) to mitigate the risk of losing connectivity/hardware failures in one more AZs or the region where the DB server is hosted. This process makes sure that the transactions are available in more than one physical location and can always be accessed to recover from a disaster event.
  • Binlogs_to_s3 service has an infinite loop in the main function, which does the following important tasks:
    • The service invokes the Orchestrator API to check if it is on a primary node, once every 5 minutes. This makes sure the service is active on the primary node even when a failover happens. 
    • The Service on the primary node uses a SQLLite database to keep track of the binlog files that have been backed up. Any file not part of the list in SQLLite list, is then uploaded to S3, using a worker pool of goroutines. The folder name for the binlog file follows the format:

shardName/year/month/day/<server-uuid>/<timestamp>_<binlog_name>

Example binlog file in S3 bucket:

 dbaas-stg-shard_1/2021 / April / 1/    9d74852a-830d-11ea-b87b-0ac50996c5a3/ 

1617235403939395577_mysql-bin.005075 

  • The DR S3 bucket is configured with Cross Region Replication (CRR) so that any object in the bucket from the host region will be automatically replicated to the DR region. So this ensures that the binlog files will always be available for restoration even if one AWS region is completely down.
  • Binlog service exports metrics such as number of binlog files uploaded, binlog size, and any errors in upload, etc to Prometheus server. A Grafana dashboard has been created to show the latest status of metrics from both db_snapshot and binlogs_to_s3 services.

DR Recovery Service

DR Recovery service is the procedure to re-create a DB instance in the DR region— after a disaster event makes the host region unavailable—with the data backed up from the DB instance in the host region. This process creates a new DB instance from the DR backup snapshots and binlog backups created for a product/shard in the DR region. The DR recovery process should make sure to avoid any replication errors, and ensure the integrity of data.

The DR Recovery procedure is done using Terraform templates and ansible playbooks in DBaaS. The DR recovery process follows the steps below:

1. DR Instance creation

  • A new DR instance is created in the DR region using Terraform Enterprise Workspace for the given product/shard.
  • MySQL volume is created from the latest available DR snapshot and attached to the DR instance. 
  • The terraform templates assign a special IAM role for the DR recovery process which needs to have more permissions than normally needed for a DB instance, for example, creating a new EBS volume and attaching to the DR instance, etc. Once the recovery is done, the IAM role of the instance will be reverted back to the normal IAM role for DB operations.

2. DR Recovery Playbook

  • The DR Recovery playbook is used to set up the DR instance so that it is updated with all the latest binlogs from the DR S3 bucket. 
  • It creates a temporary MySQL server inside a docker in the same instance.
  • It sets up the local MySQL server temporarily as a client to the MySQL server in docker to get the latest binlogs from the DR S3 bucket for the product/shard.

3. DR Binlog streamer docker

  • Binlog streamer is a docker instance created by the DR recovery playbook.
  •  It downloads the binlogs from the DR S3 bucket after the time from which the DR snapshot was taken.These binlog files will be copied into the binlogs folder for the MySQL server inside the DR binlog streamer docker and creates the binlog index file called mysql-bin.index
  • A temporary MySQL server is started within the docker, which will be ready to serve the binlogs to the clients from it’s binlog folder. 
  • Once the client is replicated with the latest binlog files, this docker instance will be stopped and the temporary volume created for this will be removed.

4. DR Recovery Recovery Rundeck Jobs

  • DR Recovery Rundeck job is the UI provided for the DBA to start the DR recovery process such as the product/shard for which the DR instance needs to be created.
  • The Rundeck job internally calls the DR playbook with the options provided by the user from the UI.  

Conclusion

Regular backups and a robust recovery mechanism are essential for any database management system. For the sake of disaster readiness,  it is essential to have a robust backup and recovery policy that keeps DB backups ready in another region so that a DB instance can be recovered any time in another region for data safety and business continuity.