How we switched the AWS account for an Elastisearch cluster with no impact on users

The Observability platform within Freshworks takes care of observability aspects (logging, metrics, tracing) for all the product and platform teams. Earlier, we had operated two Elasticsearch clusters serving live traffic and had similar use cases in two different AWS accounts. However, for cost saving and easier maintenance it made more sense to have a single cluster in one AWS account. The challenge here was to do this with zero downtime, without any impact on the end users.

Initial Setup

We orchestrate our Elasticsearch clusters using AWS Opsworks with Chef cookbooks. The operating versions of Elasticsearch were 6.8.12 and 6.8.4 . Both clusters were in the same region: US-EAST-1. In both we have the Master, Client and Data layers separate.

Read and Write traffic are load balanced to the API layer, which will connect with the ES clients and do the required operations.

As you can see in the diagram below, architecture was identical in both the accounts.

Strategy

Since there was only a minor difference in version between the clusters, we did not expect any breaking changes in the two ES cluster versions. After testing compatibility of index data and mapping on version 6.8.12, we decided to have a single cluster with this version.

Represented in the above diagram is the state we wanted to achieve. Here we have only one cluster running in our primary account and all the traffic from account B is routed to A through HaProxy. Follow the Green arrows to find out where the new requests flow. Routing through HaProxy was needed as we can not break the setup for our existing users, which required the same endpoint.

We also made changes to the behavior of API service running in A such that API contracts are not changed for users of service in B. 

For migrating data from 6.8.4 to 6.8.12 we used “snapshot restore”. which is an in-built feature of ElasticSearch not covered in this post. More details are here.

Below we are detailing steps we performed to move the ES cluster and switch traffic.

Setting up the network connectivity

To move a live cluster from one account to another, we were going to boot new nodes in ‘A’ and stop nodes in ‘B’ after data transfer. This required nodes to discover each other across accounts. For this we did VPC peering of the VPCs where the cluster existed. Detailed steps to create VPC peering can be found here

Once the peering was done we needed to make sure the route table entries target the peering connection id in both the accounts. This was to make sure there existed a route between concerned subnets. Once done, an instance booted in ‘A’ was able to ping an instance in ‘B’. If it doesn’t work for you, you should check your security group settings.

Data Migration

1- For booting nodes in A and to make sure they joined the cluster in B we tweaked our chef recipes such that the list of master’s IPs were fetched and attribute discovery.zen.ping.unicast.hosts was set with these IPs in elasticsearch.yml. We made sure to have the same cluster name as already running while booting new nodes. Once up, these nodes joined the cluster and we were able to observe the shards being moved to new nodes. 


2- We controlled this movement and did not allow live(write) indices to move, since movement of write indices takes more time. We disabled allocation of live indices using the below command.

curl -X PUT "localhost:9200/live-index/_settings" -H 'Content-Type: application/json' -d'

{

"index" : {

         "routing.allocation.enable" : "none"

    }

}

’ This ensured the live indices could not be moved to other instances. 

3- To move all the indices except the live index from old nodes, we can exclude all the old data nodes using the below command.

curl -XPUT localhost:9200/_cluster/settings -H 'Content-Type: application/json' -d '{

 "transient" :{

    "cluster.routing.allocation.exclude._ip" : "10.99.22.26,10.99.22.76,10.99.22.177"

  }

}

4- Once all indices were moved except the “live” ones, we rolled over the index. Since we’d excluded the old nodes all the new shards were allocated only only to the new nodes. It is possible to enable the allocation now for the remaining indices and they would  be moved to new nodes as well. With this, data will be migrated to new nodes in A.

5- We booted the new master and client nodes in account ‘A’. We made sure they joined the cluster, their roles were proper and discovery.zen.ping.unicast.hosts was set correctly. 

Setting up HAProxy and switching traffic to Account A

We booted the HAProxy instances, which would route the traffic to load balancer in A. HAProxy instances were attached to a target group.

During testing we caught an issue with HAProxy. HAProxy by default does DNS resolution on startup. We were providing DNS of LB in A, and once IPs of LB changed the system threw up 503 errors, as new IPs were not resolved. Adding the below configuration in HaProxy solved the issue for us.

resolvers myresolver

        nameserver dns1 dns_server_ip

        resolve_retries 30

        timeout retry   30s

        hold valid      30s

After testing connectivity and API responses from HaProxy, we switched traffic to the HAProxy target group from the API service in Load Balancer in B.

Monitoring

We had monitoring in place already.

Once new nodes are booted in Account A, data migration was visible in dashboards in new nodes.

Once the traffic switch was done, we were able to see traffic in our service in Account A.

The requests which were successful.

Clean up

Once the whole cluster and service were moved to account A, we made sure that no shards were available on the nodes in account B. We stopped all nodes in account B including the ES nodes and proxy service. We stopped masters in account B one at a time and made sure a new master was elected in account A.

VPC peering and related route table changes are still intact as traffic is still flowing between VPCs.