Moving to self-managed EC2 instances from OpsWorks for Elasticsearch clusters

The observability platform within Freshworks manages telemetry events such as logs, metrics, and traces generated by Freshworks product and platform teams. Elasticsearch is the data storage for all log events. We have 600+ node clusters consisting of 2+ petabytes of data. Elasticsearch clusters are hosted in AWS OpsWorks and orchestrated using Chef cookbooks. Long before the AWS end-of-life for OpsWorks was announced, our goal was to move away from OpsWorks since it has a few drawbacks:

  • Limited support for different types of instances
  • High instance setup time

Existing setup

We orchestrate our Elasticsearch clusters in AWS OpsWorks with Chef cookbooks (Chef 11). The operating version of Elasticsearch is 6.8.4. The OpsWorks stack has master, client, and data layers. The Elasticsearch client layer will receive read traffic from the API layer, whereas write traffic will be received from worker nodes.

New solution

The solution uses AWS EC2 launch templates + Rundeck + Ansible playbooks.

We’ve considered running Chef cookbooks on machines booted using EC2 launch templates, but that has few drawbacks:

  • We need to install Chef client per machine, which results in higher boot time
  • We miss out on OpsWorks lifecycle management
  • Remote execution is better with Ansible

Ansible:

Ansible is an open-source automation tool that allows us to define infrastructure as code using Ansible playbooks. Playbooks specify the desired state of our systems, configurations, and applications.

Rundeck:

Rundeck is an open-source software platform that provides a centralized interface for automation and orchestration across various systems.

EC2 launch templates:

Launch templates provide a way to define the configurations for your instances, including the instance type, Amazon Machine Image (AMI), security groups, key pairs, and other settings. This makes it easier to maintain consistency across your instances and helps automate the process of instance provisioning.

The solution is to boot an instance with desired configuration and execute Ansible playbook, which consists of Elasticsearch installation and configuration via Rundeck.

Implementation

Adding EC2 launch templates

We’ve created launch templates via Terraform, with required instance type(s), AMI, security groups, subnet, and desired tags for cluster formation. Below are a few tags that have been added in Ansible playbook to achieve whatever we’re doing with OpsWorks:

cl_name - cluster name
cl_layer - type of node in cluster master/client/data
cl_nodetype - storage tier hot/cold if it's data node
Name - unique instance name

Earlier we were deriving this from the OpsWorks layer. For example, if layer is cluster-1-data-cold it is of Elasticsearch data node and cold tier and belongs to cluster-1

Ansible playbooks

For Elasticsearch installation and configuration, the existing recipe functionality is replicated using Ansible playbooks. We use EC2 metadata (availability zone)/custom instance tags (name, cluster, layer (master/client/data)) information in Elasticsearch configuration to replicate in elasticsearch.yml file. 

Since we have the recipes already, we just need to write Ansible playbooks for the same functionality. Writing playbooks is similar to recipes, but there is a change in syntax. Here is an example to run Elasticsearch as a system service using Chef and Ansible:

Chef:

Chef::Log.info "ES service: Adding es service to chkconfig - centos"
service 'elasticsearch' do
    supports :start => true, :stop => true, :status => true, :restart => true
    action [ :enable, :restart ]
end

Ansible:

---
- name: es_service
  tags: es_service
  block:
  - name: Enable service elasticsearch, and not touch the running state
    service:
      name: elasticsearch
      enabled: yes
  - name: Restart service elasticsearch
    service:
      name: elasticsearch
      state: restarted

In OpsWorks, we are using the discovery.zen.ping.unicast.hosts list for master node discovery.

For master node discovery in Ansible, we’ve used the Elasticsearch EC2 discovery plugin. These are the settings used in configuration for master node discovery:

discovery.zen.minimum_master_nodes: 3
discovery.zen.fd.ping_retries: 5
discovery.zen.hosts_provider: ec2
discovery.ec2.endpoint: ec2.us-east-1.amazonaws.com
discovery.ec2.tag.cl_layer: master
discovery.ec2.tag.cl_name: cluster-1

We’ve installed the EC2 discovery plugin and added cl_layer and cl_name tags to master nodes.

Rundeck

A job is added to run the desired Ansible playbook on the required EC2 machines. This is a remote execution facilitated via SSH (have used the same ec2_keypair as EC2 machines).

Migration

Once done with the implementation, we did phase-wise migration to move the cluster:

  1. Added cl_name and cl_layer to existing master nodes in OpsWorks so that newly booted nodes in EC2 can discover them.
  2. Added a few client nodes with the above approach to verify if they’re able to join the same cluster.
  3. Moved data nodes in sets to ensure there’s no performance impact by adding new nodes and excluding old nodes.
  4. Stop nodes as soon as their allocation becomes zero.
  5. Repeat 2 and 3 until all data nodes are moved.
  6. Move remaining client nodes.
  7. Move master nodes.

We’ve done this in a controlled manner using the following cluster settings:

curl -X PUT "localhost:9200/_cluster/settings?pretty"
-H 'Content-Type: application/json' -d '{
"transient" : {
    "cluster" : {
      "routing" : {
          "rebalance" : {
              "enable" : "all"
          },
          "allocation" : {
             "node_concurrent_incoming_recoveries" : "1",
             "cluster_concurrent_rebalance" : "2",
             "node_concurrent_recoveries" : "1",
             "node_concurrent_outgoing_recoveries" : "1"
          }
      }
    },
    "indices" : {
        "recovery" : {
            "max_bytes_per_sec" : "250mb"
        }
    }
}
}
'

We moved roughly about 600+ node clusters consisting of about 2+ petabytes of data within a week.

Conclusion

  • It took roughly a week to migrate all 600+ node clusters
  • A critical part of the migration is migrating Elasticsearch cluster master nodes discovery, which we’ve handled using the EC2 discovery plugin
  • While migrating Elasticsearch clusters, we need to take care of the data transfer rate so clusters won’t be impacted
  • Though we’ve done this for datastore instances, we can use Ansible for any stateless services as well