How Freshdesk uses HAProxy at scale
Freshdesk believes in delighting customers with a top-of-the-line customer experience, and reliability plays a major role in that. As more and more organizations adopt cloud computing for greater innovation at lower costs, there is a rising anxiety about reliability, security, and stability. In a blog on Google Cloud, Dave Rensin (Director, Customer Reliability Engineering) relates that customer anxiety is inversely proportional to reliability, and hence it is important for us to focus on reliability.
Freshdesk is multi-tenant. Multi-tenancy means that multiple customers of a cloud vendor are using the same compute resources. In a multi-tenant architecture, an issue for a single customer can easily cascade and impact all its neighbors. As a matter of fact, there have been incidents in the past where a simple failure for one customer at the database level has threatened to take the entire web cluster down.
To build a reliable product, blast radius must be minimized. Blast radius is the maximum impact that might be sustained in the event of a failure. To minimize the blast radius and increase availability, shell architecture was introduced in Freshdesk. Shell architecture is a logical grouping of compute resources. Apart from providing high availability, shell architecture also gives the flexibility to increase compute capacity for specific use cases.
Freshdesk is powered by hundreds of web servers hosted in AWS. An application load balancer is used to receive all web traffic. In Freshdesk, HAProxy, a high availability proxy, has been deployed between the application load balancer and web servers.
HAProxy plays a major role in shell architecture and is the place where the routing logic for shells has been implemented.
In Freshdesk, 5 types of shell routing are supported by HAProxy :
- Host or domain based routing
- Path based routing
- Subscription type based routing
- Shard based routing
- Header based routing
Host or domain based routing
Freshdesk’s end-user traffic (customer’s customer) is unpredictable in most cases. There is usually a huge surge in traffic whenever there is a problem with the customer’s business. In such a scenario, the entire traffic for that customer can be routed to a separate shell, which gives the flexibility to have a dedicated compute capacity for that customer and minimize latency issues for other tenants.
Path based routing
In Freshdesk HAProxy, maintenance activity URLs have been routed to a separate shell. Since these are maintenance activities and their background jobs can be drained relatively slower than real-time background jobs, using a separate shell gives the flexibility to use a separate redis endpoint without impacting the processing of real-time jobs.
Subscription type based routing
Freshdesk HAProxy gives the flexibility to route requests of customers with a specific subscription type to be routed to a separate shell.
Shard based routing
A shard is a horizontal partition of data in a database. Each shard is held on a separate database server instance to spread the load.
Whenever there is a degraded performance/outage at a shard level, that shard’s traffic can be routed to a separate shell to isolate the problem. In Freshdesk, a dedicated shell called ‘buffer shell’ is maintained for this purpose and once the issue is resolved, the traffic is routed back to the original shell.
Apart from this, Freshdesk’s web traffic has been split and routed to separate shells based on shard’s revenue and RPM.
Header based routing
During migration of customer data, there is usually a spike in background jobs, which impacts the processing of jobs enqueued from live customer traffic. To solve this, a particular header is set for all migration requests so that they can be routed to a separate shell with a separate redis endpoint for sidekiq.
One might wonder, ‘Why header based routing, why not route that customer’s entire traffic to a separate shell?’
Well, most of the time, the migration happens after the customer has gone live. In such cases, domain based routing would not help much as both live traffic and migration traffic for that particular customer will end up in the same shell.
Now, let’s dig a level deeper and see how things are designed within a single shell. Within a shell, there are multiple layers that have been segregated based on business use case and technical use case. Layer is a logical grouping of web servers which serves the same end-user functionality.
Though all the web servers run the same application, the requests to these web servers have been segregated based on the business requirements and nature of the web requests.
Business use case
Freshdesk is a B2B helpdesk SaaS product and its customer’s customer traffic (end-user support portal) is given the highest importance from an availability stand point. Freshdesk wants its customers to provide their customers the best possible end-user experience with respect to availability and performance.
Freshdesk’s direct customers will log in to the product as agents and their customers will log in to the support portal or just use the support portal to get some queries clarified.
These 2 URLs below belong to the same account but serve different sets of users. The first one is for end users, and the second one is for support agents.
Freshdesk has a separate layer to handle end-user traffic, which is the support layer. With path based layer routing rules in HAProxy, support URL requests are routed to the support layer.
Though Freshdesk has autoscaling in place, a sudden burst in traffic may, at times, increase the wait time in the queue for these web requests. The head room capacity is different for each of these layers, giving a better cushion to absorb unexpected traffic and thus ensuring better availability.
Apart from this, requests are also segregated from a technical standpoint.
Technical use case
Taking a hypothetical example: A Tickets page might take ~100ms to load whereas a Reports page might take ~300ms to load. If these 2 kinds of requests land in the same layer, a Ticket page view request might have to wait in a queue for a Reports page request, which could probably take 3x more time to process.
By segregating and routing these requests to separate layers, Freshdesk ensures a consistent performance for requests of more importance.
This type of segregation also improves the debuggability of an issue that affects just one set of traffic. For example, if request queueing, memory usage or CPU is high only in the search layer, this helps in narrowing down the focus to that.
Incoming request flow
The diagram below explains, on a high level, what happens to any incoming request in HAProxy.
For any incoming request, a ‘lua’ method is used to choose the backend. In this method, the first step involves choosing a shell by checking the shell routing rules. If none of the rules apply, the default shell is used, which is currently configured in a json maintained in Git.
After the shell selection, the layer routing rules are checked. If none of the rules apply, the default layer is used.
After the shell and the layer have been chosen, the web server where the request needs to be routed is selected using the least connection algorithm — which maintains a record of active server connections and forwards a new connection to the server with the least number of active connections.
In the diagram above, there are multiple web servers running on port 81. Web server 4 has the least number of active connections, and hence the request is routed there.
In a scenario where there is more than one server with the same number of active connections, HAProxy will fall back to the round robin algorithm — a Load Balancing algorithm that selects servers in turns.
In the next post, we’ll focus on how, fueled by HAProxy, Freshdesk achieves controlled rollouts, quick rollbacks, and reliable and consistent behavior for a particular domain.
Cover and image design: Vignesh Rajan
Subscribe for blog updates
Thank you for subscribing!
OOPS! something went wrong try after sometime