Scaling the Freshdesk web app for high Availability

[Graduating from startup to scaleup means having to constantly evolve your applications to keep pace with your customers’ growth. In this Freshworks Engineering series, Rails@Scale, we talk about some of the techniques we employ to scale up our products.]

At Freshworks, we believe in making customers for life by ensuring good quality, exceptional user experience, and high availability of our products. We at the Site Reliability Engineering (SRE) team specialize in high availability and run our products with minimum downtime or interruption, thereby enabling our customers deliver moments of wow to their customers.

This article covers how the Freshworks SRE team scales our monolithic Rails app for our Freshdesk product. At its peak, this application serves ~500K RPM. In the upcoming blogs, we shall discuss about the various tools and techniques used by our developers under the following themes:

  • Availability
  • Backtracking
  • Performance

In this article, we’ll provide a deep dive into Availability.

Tackling the Availability challenge with Freshdesk web app

Freshdesk is a multi-tenant SAAS application with a sharded database layer. The incoming web requests are served by Passenger processes, which are single-threaded, and at any point in time they can serve only one user request. Therefore, depending on the host where the application runs, multiple processes are spawned using copy-on-write, which enables us to handle multiple incoming web requests simultaneously.

This image describes the architecture breakdown for Freshdesk web request ingress.
High level breakdown of Freshdesk web request ingress (Figure 1).

Freshdesk follows a subdomain based multi-tenancy model where every tenant is assigned a unique subdomain. For instance, if customer1 and customer2 are two tenants then their assigned subdomains will be customer1.freshdesk.com and customer2.freshdesk.com respectively.

As you can see in the above image, we use Amazon Route53 for DNS, and the record sets corresponding to subdomains are mapped to one of the ALBs(1..n). TLS termination occurs on the ALB. When a request reaches ALB, it is forwarded to HAProxys(1..n) that are attached to the ALBs. HAProxy in turn sends the request to Envoy sidecar proxy, which forwards it to the Passenger Rails app container, whose breakdown is explained below.

This image describes traffic flow inside an application container.
Traffic flow inside Application container (Figure 2).

As explained, we use Envoy as a sidecar proxy and have configured Passenger with Nginx. Once the request reaches Envoy, it forwards the request to Nginx workers, which then sends it to the Passenger core. Passenger core maintains a pool of Passenger application processes and forwards the request to the one of the available processes in the pool.

The limitation of this single-threaded model of Passenger is that at any given time only one user request is handled by one Passenger process. Spawning new processes has a time overhead, and is limited by system resource constraints. As a result, if the available process pool is held up and takes longer than the expected time to process requests, it impacts the capacity to serve requests. At ~500K RPM, even a small percentage variance in response time can cause request queuing, and result in poor user experience for the customers. 

Hence, the challenge for us is to optimize and avoid long running requests. A request can take a long time for many reasons, but the most common ones are:

  1. Long running database transactions 
  2. Unresponsive database shard
  3. Slow outbound network calls (3rd party/AWS services)
  4. Spam and DoS for one tenant impacting others
  5. Other problems related to application code

Scenarios and solutions

For each of the above scenarios, a solution has been implemented. Let’s discuss the scenarios with solutions in more detail in this section.

1) Long running database transactions

Database transaction locks and long running read queries that consume high CPU impacts the performance of other queries, which are being parallelly processed, and this latency can cascade back to the application resulting in poor performance.

Solution

To solve the cascading problem we use Mysql DB events and periodically check for long running transactions or queries and abort them. By aborting, the underlying TCP connection is closed with the application and a subsequent rollback operation is carried out by the database.

The following is a snippet of the DB event that is used to abort long running transactions. It periodically checks for transactions from the INNODB_TRX table and aborts transactions that exceed the maximum time allowed.


CREATE EVENT kill_long_transactions ON SCHEDULE EVERY x SECOND
DO
  BEGIN
      DECLARE max_transaction_time INT DEFAULT 25; 
      DECLARE done INT DEFAULT 0;
      DECLARE killed_id BIGINT;
      DECLARE killed_user VARCHAR(16);
      DECLARE killed_host VARCHAR(64);
      DECLARE kill_stmt VARCHAR(20);
      DECLARE time_taken VARCHAR(20);
      DECLARE running_stmt TEXT;
​
      DECLARE long_transactions CURSOR FOR
         SELECT 
              trx.trx_mysql_thread_id thd_id,
              ps.user,
              ps.host,
              CURRENT_TIMESTAMP- trx.trx_started,
              trx.trx_query
         FROM INFORMATION_SCHEMA.INNODB_TRX trx
         JOIN INFORMATION_SCHEMA.PROCESSLIST ps ON trx.trx_mysql_thread_id = ps.id
        WHERE (trx.trx_started < (CURRENT_TIMESTAMP - INTERVAL ‘x’ SECOND)) AND ((ps.user!= 'migration') and (ps.user !='system user')and (ps.user !='event_scheduler') and (ps.COMMAND != 'Killed'));
       
      DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1;
     
      OPEN long_transactions;
        kill_loop: LOOP
          FETCH long_transactions INTO
            killed_id, killed_user, killed_host,time_taken, running_stmt;
​
          IF done THEN
            LEAVE kill_loop;
          END IF;
​
          SET @kill := killed_id;
          CALL mysql.rds_kill(@kill);
        END LOOP;
      CLOSE long_transactions;
  END$$
DELIMITER;

Gotchas

A note of caution on aborting transactions is that the application and the ORM must be tested against rollbacks and connection failures. A complete sanity verification is recommended to check against stale or orphaned data in the database. We have observed that older versions of Rails swallow errors during connection failures, which leads to orphaned records in the database.

2) Unresponsive shard

We use Amazon RDS for MySQL and a database shard can become unresponsive under scenarios, such as:

  • A multi-AZ failure with a switchover 
  • When connections to the shard is interrupted
  • A network partition

Solution

Request queuing and capacity loss when one shard is unresponsive can be handled with a circuit breaker in the application layer, which will immediately cause the requests to fail to an unresponsive shard.

We were looking for a circuit breaker gem that has adaptors for MySQL, Redis and HTTP. On further analysis, we found that Semian, built by Shopify, is the best production tested circuit breaker gem for a Rails application.

We wanted to ensure that there were no false positives in production so we built a dry run mode where the circuit transition states are only logged without affecting the real network connection. This helped us to identify bugs and unforeseen cases, which are listed below:

  • MySQL timeout error was not captured and we fixed it with a PR to the source repository.
  • Application integrated APM tools like Newrelic maintain a DB connection pool to run explain queries. It contributes as a query to Semian. We configured Newrelic to turn off explain queries.
Prolonged request queuing observed without Semian (Figure 3)

 

Limited request queuing observed with Semian (Figure 4)

Gotchas

Although Semian helped us to fail faster in cases whenever there is a DB network connectivity issue, we realized that it is not the best solution for the following cases:

  1. Database can become slow or unresponsive when the CPU consumption is high or for cases when a transaction is locking a row or a table alone. Locks primarily happen in a small subset of busy tables and Semian cannot be configured to fail requests only to a subset of tables in the database. Semian works best for cases when the entire DB is unhealthy on the network side. An example where Semian works best is a Multi-AZ failover.
  2. Semian circuit breaker works at per process level and a distributed or per host instrumentation is still under development.

We continue to have Semian enabled in Production and it has helped us as shown in Fig 3 & Fig 4. As an iteration, we are evaluating ProxySQL sidecar with every application pod, which will help us to have better connection and query control along with capabilities like circuit breaker at per host level, caching, and observability.

3) Slow outbound network calls

Like most modern web applications, Freshdesk also makes calls to various microservices and downstream services. Most of these calls are over Http(s). A connectivity issue or latency in any of the downstream services blocks new requests from being processed, which in turn contributes to request queuing on the application.

Solution

Configure strict connect, read and write timeout on all HTTP client libraries used in the application. For frequently used microservices like the search service, we have a circuit breaker on top of HTTP, which helps to fail the requests faster.

We also proxy all the outbound external calls to our internal microservices using Envoy Sidecar Proxy, which gives us better visibility and controls like rate limiting, connection pooling, timeouts, retries, and response caching.

Request flow for a search request where the application makes outbound call via Envoy Sidecar proxy to the search service (Figure 5)

 

Metrics for outbound external calls that are captured using Envoy Sidecar proxy

4) Spam, DoS for one tenant impacting others

Request spams to a single tenant, abnormal or misuse by tenant(s) on a shared multi-tenant cloud architecture causes latency and performance problems since most of the capacity ise consumed by these abnormal requests. Therefore, throttling spam or overuse is a necessity.

We have throttling capabilities for both web requests  and background jobs. For the web requests, we have an Envoy sidecar proxy with capabilities to throttle incoming requests based on parameters like domain, and IP.  For the background jobs, we use Sidekiq and have server-side middleware to pause jobs for a given tenant, and are currently evaluating on the fly throttling capabilities for any given Sidekiq job kind.

 In addition to throttling, we have configured HAProxy to deny requests for blacklisted subdomains and IPs.

A recap from Figure 1:

  • All our ingress to the Web App is through ALB -> HAProxy -> Envoy -> Application container. 
  • Freshdesk follows a sub-domain strategy for tenants (i.e) If a tenant `abc` signs up for Freshdesk then their assigned domain will be`abc.freshdesk.com’.

The following is a snippet of HAProxy configuration, which helps us to block requests for abusive sub-domains and malicious IP addresses:


# Simple HAProxy config to block ip, domain
acl blacklist_ip hdr_ip(x-forwarded-for,-1), map_ip(blacklisted_ips.lst) -m found
acl blacklist_domain hdr(host), map_str(blacklisted_domains.lst) -m found
http-request deny if is_bad_request or blacklist_ip or blacklist_domain

For cases when the sub-domain cannot be blocked and the abuse is originating from varying client IPs, we have dynamic throttling capabilities built-in using a combination of the Envoy sidecar proxy with our in-house throttling platform service, ‘Fluffy’.

5) Other problems related to application code

An unidentified looping in business logic, a bug in dependent gem, and badly implemented regex are some of the cases where an incoming request will be stuck forever.

Solution

To identify such cases on production we have configured long running request alerts on Nagios and receive a thread dump for the process using GDB. After the thread dump, we initiate a SIGKILL to the long running process and Passenger core spawns a new process to maintain the minimum pool size that is required to handle incoming requests.

Conclusion

All of the above measures helped us to reduce the impact of long running requests and ensure high availability. If you have thoughts and questions feel free to reach out to us. Stay tuned for our next post where we shall take you through the tools used for backtracking and debugging production incidents.

Related links

For Semian, Marginalia, and much more on scaling Freshdesk, watch the tech talk at Ruby conference, India video.  You can also read this interesting blog where we used various system debugging tools. 

Acknowledgements

We thank the Shopify Engineering team for their open source work on Semian and their corresponding blog articles on implementation details.