Behind the scenes: Safeguarding Freshdesk’s infra from BG jobs

Freshdesk uses shell architecture to reduce the blast radius. Please read about how we use HAProxy to route foreground requests and isolate customers here

The inspiration behind shell architecture is Shopify pod architecture. Shell separates customer traffic into individual cells or logically isolated infrastructure components to avoid the noisy neighbor problem. 

The below diagram captures how we use shells to isolate customers. We keep around 60K requests per minute (RPM) in one shell and group multiple shards together. Issues in one shell will not affect another, reducing the blast radius.

While shell architecture helps us reduce the blast radius, we rely heavily on Sidekiq for processing and managing the threads for background jobs. Freshdesk processes more than 100M background jobs every day, which can be classified for better efficiency (the architecture for which has been covered in an earlier post). While using custom middleware and managing the queues have given us better efficiency, it has also increased our blast radius and limited the reaction time. 

Before we jump into how we reduced the outages and scaled our background jobs, here is a brief introduction to Freshdesk database architecture.

Freshdesk database architecture

Freshdesk is a read-heavy application; we use MySQL replication and distribute reads among different replicas for high availability. Multiple customers are partitioned into other shards to scale horizontally, and every shard has a few replicas to serve different kinds of traffic. Please refer to this earlier post for how we horizontally scale with sharding.

We have a primary and two secondary replicas for every shard to handle the load. All the database writes go to primary, and then replica reads are partitioned into two: a replica for taking API and front-facing traffic and another replica for our background jobs, reporting, and analytics. We made a compromise to get better availability by sacrificing a little on consistency (CAP theorem).

After analyzing outage data, we observed that background jobs with suboptimal database queries running in parallel caused 70% of our database-related outages. 

We have categorized the issues into:

  1. High write causes a lag in primary and replicas 
  2. Sub-optimal query and workers running concurrently 

At Freshworks, we took proactive and reactive measures to address the problem. 

Performing many writes on the primary can cause massive lags in the replica, resulting in considerable performance degradation and a lousy customer experience. After much analysis, we discovered that only a few workers were responsible for many writes. We classified those workers and ensured that we throttled their writes to our MySQL database. 

Solution 

For throttling, we needed a service that could identify primary replica lag and give feedback to the background app servers to pause or postpone writing. MySQL exposes a command that comes to our rescue here. Using the MySQL SHOW REPLICA STATUS command, we can check if a replica is behind the primary and use this as feedback for our application. Initially, we wrote our service, which will execute SHOW REPLICA STATUS regularly, and our background applications were using them to decide whether to throttle the writes or continue. The drawback of that approach is that the service needs HA and other best practices required for the solution to extend to any database store. The different answer was to resort to ProxySQL, but ProxySQL lacks REST API support applications can make use of. After a brief search in the open-source world, we landed in Freno. 

Enter Freno: What is Freno?

Freno is a cooperative, highly available throttler service. Clients use Freno to throttle writes to a resource. The current implementation can throttle writes to (multiple) MySQL clusters based on the replication status for those clusters. Freno will throttle cooperative clients when replication lag exceeds a pre-defined threshold. Freno dynamically adapts to changes in server inventory; the user can further control it to force the throttling of certain apps. Freno is highly available and uses the Raft consensus protocol to decide leadership and pass user events between member nodes. You can read more about Freno here

(Image credit: Freno GitHub )

The above diagram represents the Freno request flow. Our Rails app consults with Freno to check if they can write to the backend databases. 

Freno also has a Ruby client, which we can use to throttle our BG jobs or use REST standards exposed by Freno to achieve the same. We can write a sample Sidekiq worker that can postpone its operations upon an answer from the Freno service. 

A sample worker is attached here.

What the worker does is pretty simple: Before the execution of the worker, it checks if there is a database lag and proceeds with the action if there is none. If there is a lag, we can re-enqueue the job after a delay. This delay can be a default of lag seconds or a minimum of 60, whichever is higher, ensuring that Sidekiq jobs do not cause a lag in the replica as we check for the same before we start the execution. Now, we will discuss how we tackle the sudden spike in the DB due to BG jobs. 

Problem

Bad query and workers running concurrently causing CPU spikes in the DB.

Although we use sharding to scale horizontally, multiple tenants residing in the same shard can result in noisy neighbor problems within the same shard. Often, a Sidekiq job may be benign in low-throughput hours. Still, it can suddenly become a disruptive CPU-heavy job for the DB due to a sudden inflow of events or a newly introduced worker with a bad query when executed on high concurrency, which can bring down the DB.

Solution

For the replica lag scenario, we use Freno to throttle the workers; as we know, the workers are responsible for massive writes to the primary. This proactive approach prevents replica lag before it becomes substantial to handle. For sudden spikes in the DB CPU, we have observed that CPU spikes arise from many Sidekiq jobs, old and new. By default, Sidekiq provides many mechanisms to pause a queue and delete a worker. However, we needed a more granular control on Sidekiq jobs at the tenant level, so we devised three approaches to control them:

  1. Functionality to drop or blackhole a worker at the tenant level. Blackholing is nothing but preventing an enqueued job from execution. Both the Sidekiq client and server middleware adopt this feature.
  2. Ability to drop a worker class altogether. In this case, all the jobs related to every tenant will get removed or blackholed. 
  3. Ability to reroute a job to another queue or a shell if required (high priority for business-critical jobs).

Both points 1 and 2 are beneficial if we drop the jobs that are not business-critical and can be triggered later. Point 3 is valid when we cannot drop jobs that may be business-critical for our customers but can be drained slowly with less concurrency. At Freshworks, a buffer shell comes in handy in this case. We generally maintain a buffer shell for rogue jobs that can drain slowly. A sample code to achieve this by tweaking Sidekiq server-side middleware is attached. We can also do the same on the client side to enhance the process. 

What is the specific role of Freno in architecture?

Impact: Having Freno throttle jobs has eliminated our primary replica lag, and today, Freshdesk can serve our customers with high writes without a hit. As a rule of thumb, isolate all the Sidekiq workers, classify them separately, and throttle them if possible to prevent replica lag using tools like Freno, which will help us achieve this. 

For jobs that cause massive CPU spikes in the DB, we can tweak the Sidekiq middleware to blackhole them. If the jobs are critical, we can reroute them to a shell or a layer to run with less concurrency, reducing the load of the downstream systems like DB and preventing an outage. This approach has saved us from many outages and almost made the meantime recover within 30 minutes of an incident. 

Summary 

Freshworks uses shell architecture to reduce the blast radius, but we need other mechanisms to enhance our reliability regarding background jobs. We employ both proactive and reactive measures to improve our availability. As a proactive measure, we use Freno to ensure our Rails app can write to backend databases without causing lag. We have a Sidekiq worker waiting for Freno’s response before proceeding with its operations. To prevent CPU spikes in the DB, we isolate Sidekiq workers, classify them separately, and throttle them if possible. For jobs that cause massive CPU spikes, we can blackhole them or reroute them to a shell with less concurrency. Both the approaches mentioned above are reactive. Overall, the defense mechanisms we use have improved our reliability drastically. Such mechanisms evolve, and new measures are needed to protect the infrastructure.