How we saved 20% EC2 by migrating to AWS R6i instances

In the competitive landscape of cloud services, cost optimization is a perpetual goal for site reliability engineering (SRE) teams. Our recent effort involved a strategic shift in production workload from AWS R5 instances to a more cost-effective and performant solution. This blog post delves into our journey, the rigorous analysis, the challenges faced, and the substantial cost savings and performance gains we achieved by migrating to AWS R6i instances.

The selection process

Our journey began with a critical assessment of our existing infrastructure. We were running hundreds of AWS R5 instances to support our workload. While the R5 instances had served us well, we recognized that evolving technologies and pricing models might present better options. Looking ahead, and considering our anticipated growth in traffic based on past trends, we aimed to optimize both the cost and performance of our application. This required finding a solution that was not only cost-effective but also scalable and capable of handling a more intensive workload. 

With our application’s high CPU and memory demands in mind, we set out to identify instance types that would meet these specific needs. Our attention was drawn to the AWS Graviton instances, specifically the R6g and R7g, which offered a promising price-to-performance ratio due to their ARM-based architecture. We also considered the latest Intel instances, the R6i series, renowned for their improved performance at a comparable cost.

The selection process among these instances was thorough and data-driven. We established a series of benchmarks and load tests to emulate the most challenging aspects of our production environment. These tests were meticulously designed to push the instances to their limits and to compare their capabilities side by side. We concentrated on CPU-intensive tasks such as HTML parsing to establish a baseline performance metric, and we tested various API request patterns that our Rails application typically processes to ensure a comprehensive comparison.

Diving deep into load test results: Why R6i stood out

The load test results were eye-opening. The R6i instances demonstrated a clear advantage over Graviton instances as well as R5 instances (as claimed in AWS docs) in CPU-intensive operations. For instance, when running HTML parsing, a single-threaded test with 10,000 iterations showed that the R6i instances completed tasks significantly faster across various Ruby versions.

Note: This test is to understand the performance of Graviton over the Intel instances with a CPU-intensive operation (HTML parsing) over various Ruby versions.During the read path tests of our Rails application (GET and LIST APIs), the R6i instances continued to impress. With a 30-minute test involving nine concurrent users and eight distinct URLs, the R6i instances exhibited lower latencies and higher throughput than the R5 and Graviton instances. The P90, P95, and P99 latencies were consistently better, and the average CPU utilization was lower, indicating a more efficient use of resources.The create/update API tests further solidified R6i’s lead. With 20 concurrent users generating a maximum of 400 requests per minute, the R6i instances not only had lower latencies but also maintained higher throughput and lower CPU utilization compared to the other instance types. 

The baseline of 400 requests per minute from 20 concurrent users for the defined test cases was finalized and replicated in other instance types because that was the breaking point in R5 where the CPU started taking 1 core.The background job-performance tests, which observed the behavior of one of our busiest background jobs, Sidekiq real-time processing under load, showed that the R6i instances had the highest throughput and the lowest latency and CPU usage. This testing was critical for ensuring that background processes did not become a bottleneck in our system.The 30-minute test shows that R6i is better in all aspects. However, we noticed fluctuations in average memory usage because of the garbage collection (GC) frequency. But this was ironed out in our three-hour test with more GC cycles. 

Preparing for a multi-architecture setup

While we ultimately selected the R6i instances for their superior performance and cost savings, we had to make our main application and the other components of our infra ARM-compatible to conduct this load testing.

We began by addressing compatibility within our Rails application. This process involved identifying and updating gems that were not ARM-compatible. We made strategic changes, such as switching to the ‘cld’ gem for language detection, upgrading ‘curb’ for HTTP requests, replacing ‘therubyracer’ with Node.js for JavaScript runtime, and patching ‘authlogic’ to use an updated ‘scrypt’ gem.

Our Docker image, based on ‘amazonlinux’ version, did not support ARM, prompting us to upgrade to ‘amazonlinux:2’. We replaced specific compilers and manually installed tools like ‘tar’, ‘gzip’, and ‘make’ that were not pre-installed. We navigated through a series of binary replacements and configuration changes, such as upgrading ‘exiftool’, switching to ‘protobuf-devel.aarch64’, and modifying Nginx installation options.

Then we built the app in an ARM-based instance. Building the Docker image on a Graviton instance involved setting up Docker and Git, cloning the necessary repositories, and executing a Docker build with the appropriate arguments. We verified the functionality by running the Docker container and making successful curl requests to our application endpoints.

The Kubernetes cluster setup required us to ensure that all dependencies—including sidecars, daemonsets, and additional deployments—were ARM-compatible. We built Docker images for each dependency, updating Dockerfiles with ARM-compatible binaries as needed. This process also involved upgrading base images and installing ARM-compatible versions of required binaries and development packages. Apart from the main app, we had over 30 Docker images to be converted to ARM-compatible by following the above process.

Conclusion

Our migration to AWS R6i instances was a calculated, data-driven decision that resulted in a 20% annual cost savings and a 15-20% improvement in overall latency. 

AWS
Before (with R5)
AWS
After (with R6i)

You can see considerable improvements in all aspects of the infrastructure.

The transition not only reflects our commitment to efficiency and performance but also underscores the importance of regular infrastructure reviews. We recommend that readers adopt a proactive approach to their infrastructure management. It’s crucial to evaluate usage patterns and stay informed about new options in the market. We suggest conducting a check at least once a quarter to ensure that your infrastructure aligns with your current needs and takes advantage of the latest technological advancements. 

In sharing our story, we hope to inspire others to closely examine their setups, run tests, and be open to making changes that could lead to better performance and savings. Remember, the cloud landscape is ever-evolving, and staying ahead is crucial.