Scaling the Freshdesk web app for high Availability
[Graduating from startup to scaleup means having to constantly evolve your applications to keep pace with your customers’ growth. In this Freshworks Engineering series, Rails@Scale, we talk about some of the techniques we employ to scale up our products.]
At Freshworks, we believe in making customers for life by ensuring good quality, exceptional user experience, and high availability of our products. We at the Site Reliability Engineering (SRE) team specialize in high availability and run our products with minimum downtime or interruption, thereby enabling our customers deliver moments of wow to their customers.
This article covers how the Freshworks SRE team scales our monolithic Rails app for our Freshdesk product. At its peak, this application serves ~500K RPM. In the upcoming blogs, we shall discuss about the various tools and techniques used by our developers under the following themes:
In this article, we’ll provide a deep dive into Availability.
Tackling the Availability challenge with Freshdesk web app
Freshdesk is a multi-tenant SAAS application with a sharded database layer. The incoming web requests are served by Passenger processes, which are single-threaded, and at any point in time they can serve only one user request. Therefore, depending on the host where the application runs, multiple processes are spawned using copy-on-write, which enables us to handle multiple incoming web requests simultaneously.
Freshdesk follows a subdomain based multi-tenancy model where every tenant is assigned a unique subdomain. For instance, if customer1 and customer2 are two tenants then their assigned subdomains will be customer1.freshdesk.com and customer2.freshdesk.com respectively.
As you can see in the above image, we use Amazon Route53 for DNS, and the record sets corresponding to subdomains are mapped to one of the ALBs(1..n). TLS termination occurs on the ALB. When a request reaches ALB, it is forwarded to HAProxys(1..n) that are attached to the ALBs. HAProxy in turn sends the request to Envoy sidecar proxy, which forwards it to the Passenger Rails app container, whose breakdown is explained below.
As explained, we use Envoy as a sidecar proxy and have configured Passenger with Nginx. Once the request reaches Envoy, it forwards the request to Nginx workers, which then sends it to the Passenger core. Passenger core maintains a pool of Passenger application processes and forwards the request to the one of the available processes in the pool.
The limitation of this single-threaded model of Passenger is that at any given time only one user request is handled by one Passenger process. Spawning new processes has a time overhead, and is limited by system resource constraints. As a result, if the available process pool is held up and takes longer than the expected time to process requests, it impacts the capacity to serve requests. At ~500K RPM, even a small percentage variance in response time can cause request queuing, and result in poor user experience for the customers.
Hence, the challenge for us is to optimize and avoid long running requests. A request can take a long time for many reasons, but the most common ones are:
- Long running database transactions
- Unresponsive database shard
- Slow outbound network calls (3rd party/AWS services)
- Spam and DoS for one tenant impacting others
- Other problems related to application code
Scenarios and solutions
For each of the above scenarios, a solution has been implemented. Let’s discuss the scenarios with solutions in more detail in this section.
1) Long running database transactions
Database transaction locks and long running read queries that consume high CPU impacts the performance of other queries, which are being parallelly processed, and this latency can cascade back to the application resulting in poor performance.
To solve the cascading problem we use Mysql DB events and periodically check for long running transactions or queries and abort them. By aborting, the underlying TCP connection is closed with the application and a subsequent rollback operation is carried out by the database.
The following is a snippet of the DB event that is used to abort long running transactions. It periodically checks for transactions from the INNODB_TRX table and aborts transactions that exceed the maximum time allowed.
CREATE EVENT kill_long_transactions ON SCHEDULE EVERY x SECOND DO BEGIN DECLARE max_transaction_time INT DEFAULT 25; DECLARE done INT DEFAULT 0; DECLARE killed_id BIGINT; DECLARE killed_user VARCHAR(16); DECLARE killed_host VARCHAR(64); DECLARE kill_stmt VARCHAR(20); DECLARE time_taken VARCHAR(20); DECLARE running_stmt TEXT; DECLARE long_transactions CURSOR FOR SELECT trx.trx_mysql_thread_id thd_id, ps.user, ps.host, CURRENT_TIMESTAMP- trx.trx_started, trx.trx_query FROM INFORMATION_SCHEMA.INNODB_TRX trx JOIN INFORMATION_SCHEMA.PROCESSLIST ps ON trx.trx_mysql_thread_id = ps.id WHERE (trx.trx_started < (CURRENT_TIMESTAMP - INTERVAL ‘x’ SECOND)) AND ((ps.user!= 'migration') and (ps.user !='system user')and (ps.user !='event_scheduler') and (ps.COMMAND != 'Killed')); DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = 1; OPEN long_transactions; kill_loop: LOOP FETCH long_transactions INTO killed_id, killed_user, killed_host,time_taken, running_stmt; IF done THEN LEAVE kill_loop; END IF; SET @kill := killed_id; CALL mysql.rds_kill(@kill); END LOOP; CLOSE long_transactions; END$$ DELIMITER;
A note of caution on aborting transactions is that the application and the ORM must be tested against rollbacks and connection failures. A complete sanity verification is recommended to check against stale or orphaned data in the database. We have observed that older versions of Rails swallow errors during connection failures, which leads to orphaned records in the database.
2) Unresponsive shard
We use Amazon RDS for MySQL and a database shard can become unresponsive under scenarios, such as:
- A multi-AZ failure with a switchover
- When connections to the shard is interrupted
- A network partition
Request queuing and capacity loss when one shard is unresponsive can be handled with a circuit breaker in the application layer, which will immediately cause the requests to fail to an unresponsive shard.
We were looking for a circuit breaker gem that has adaptors for MySQL, Redis and HTTP. On further analysis, we found that Semian, built by Shopify, is the best production tested circuit breaker gem for a Rails application.
We wanted to ensure that there were no false positives in production so we built a dry run mode where the circuit transition states are only logged without affecting the real network connection. This helped us to identify bugs and unforeseen cases, which are listed below:
- MySQL timeout error was not captured and we fixed it with a PR to the source repository.
- Application integrated APM tools like Newrelic maintain a DB connection pool to run explain queries. It contributes as a query to Semian. We configured Newrelic to turn off explain queries.
Although Semian helped us to fail faster in cases whenever there is a DB network connectivity issue, we realized that it is not the best solution for the following cases:
- Database can become slow or unresponsive when the CPU consumption is high or for cases when a transaction is locking a row or a table alone. Locks primarily happen in a small subset of busy tables and Semian cannot be configured to fail requests only to a subset of tables in the database. Semian works best for cases when the entire DB is unhealthy on the network side. An example where Semian works best is a Multi-AZ failover.
- Semian circuit breaker works at per process level and a distributed or per host instrumentation is still under development.
We continue to have Semian enabled in Production and it has helped us as shown in Fig 3 & Fig 4. As an iteration, we are evaluating ProxySQL sidecar with every application pod, which will help us to have better connection and query control along with capabilities like circuit breaker at per host level, caching, and observability.
3) Slow outbound network calls
Like most modern web applications, Freshdesk also makes calls to various microservices and downstream services. Most of these calls are over Http(s). A connectivity issue or latency in any of the downstream services blocks new requests from being processed, which in turn contributes to request queuing on the application.
Configure strict connect, read and write timeout on all HTTP client libraries used in the application. For frequently used microservices like the search service, we have a circuit breaker on top of HTTP, which helps to fail the requests faster.
We also proxy all the outbound external calls to our internal microservices using Envoy Sidecar Proxy, which gives us better visibility and controls like rate limiting, connection pooling, timeouts, retries, and response caching.
4) Spam, DoS for one tenant impacting others
Request spams to a single tenant, abnormal or misuse by tenant(s) on a shared multi-tenant cloud architecture causes latency and performance problems since most of the capacity ise consumed by these abnormal requests. Therefore, throttling spam or overuse is a necessity.
We have throttling capabilities for both web requests and background jobs. For the web requests, we have an Envoy sidecar proxy with capabilities to throttle incoming requests based on parameters like domain, and IP. For the background jobs, we use Sidekiq and have server-side middleware to pause jobs for a given tenant, and are currently evaluating on the fly throttling capabilities for any given Sidekiq job kind.
In addition to throttling, we have configured HAProxy to deny requests for blacklisted subdomains and IPs.
A recap from Figure 1:
- All our ingress to the Web App is through ALB -> HAProxy -> Envoy -> Application container.
- Freshdesk follows a sub-domain strategy for tenants (i.e) If a tenant `abc` signs up for Freshdesk then their assigned domain will be`abc.freshdesk.com’.
The following is a snippet of HAProxy configuration, which helps us to block requests for abusive sub-domains and malicious IP addresses:
# Simple HAProxy config to block ip, domain acl blacklist_ip hdr_ip(x-forwarded-for,-1), map_ip(blacklisted_ips.lst) -m found acl blacklist_domain hdr(host), map_str(blacklisted_domains.lst) -m found http-request deny if is_bad_request or blacklist_ip or blacklist_domain
For cases when the sub-domain cannot be blocked and the abuse is originating from varying client IPs, we have dynamic throttling capabilities built-in using a combination of the Envoy sidecar proxy with our in-house throttling platform service, ‘Fluffy’.
5) Other problems related to application code
An unidentified looping in business logic, a bug in dependent gem, and badly implemented regex are some of the cases where an incoming request will be stuck forever.
To identify such cases on production we have configured long running request alerts on Nagios and receive a thread dump for the process using GDB. After the thread dump, we initiate a SIGKILL to the long running process and Passenger core spawns a new process to maintain the minimum pool size that is required to handle incoming requests.
All of the above measures helped us to reduce the impact of long running requests and ensure high availability. If you have thoughts and questions feel free to reach out to us. Stay tuned for our next post where we shall take you through the tools used for backtracking and debugging production incidents.
Subscribe for blog updates
Thank you for subscribing!
OOPS! something went wrong try after sometime