Sidekiq queue management using custom middleware

[Ruby on Rails is a great web application framework for startups to see their ideas evolve to products quickly. It’s for this reason that most products at Freshworks are built using Ruby on Rails. Moving from startup to scale-up means having to constantly evolve your applications so they can scale up to keep pace with your customers’ growth. In this new Freshworks Engineering series, Rails@Scale, we will talk about some of the techniques and patterns that we employ in our products to tune their performance and help them scale up.]

Freshdesk is a cloud-based helpdesk product from Freshworks that’s designed to help companies wow their customers and support teams. The main application powering Freshdesk is a Rails monolith. In this article, we will cover Freshdesk’s usage of Sidekiq, the background job processing framework written in Ruby, and how we use custom middleware to manage the ever-growing number of queues in our Rails app.

In Freshdesk, background processing is achieved using a mix of SQS, Kafka and Sidekiq consumers. The bulk of in-app background processing—jobs enqueued and processed by the Rails app—happens using Sidekiq. Why? For starters, Sidekiq is a super-fast Redis-based queueing system (Redis is an in-memory data structure store used as a database, cache and message broker). Sidekiq uses threads to handle many jobs at the same time and in the same process. For a product as big as Freshdesk, we have about 200+ background queues for various types of functionalities and it will only increase as we build more functionalities.

We are hosted on Amazon Web Services and use its OpsWorks layering mechanism to create layers with EC2 instances, or virtual servers for running applications on the AWS infrastructure. We use Chef recipes to route queues to different layers (the Chef platform allows for automating infrastructure on AWS). What started initially as a round-robin distribution of queues to layers went out of control when we grew exponentially. Sometimes, it went into a random distribution of queues into layers.

At one point, we faced overutilization of instances in some layers and below-par utilization in others. Some of our layers were utilized less than 10% while some busy layers executing more jobs were processing at 60-70% capacity. We solved this by booting a few extra machines in some layers but this became unmanageable and a complex setup to maintain. Monitoring such a complex setup across four distributed control systems was a nightmare.

Peak time traffic in each DC

DCRPM
United States240,000
Europe (Frankfurt)45,000
Australia10,000
India40,000

The table shows why one set of layering or machines won’t suffice as our traffic pattern was different for different DCs. We wanted to solve this problem as this took a lot of time for us to monitor and scale. The nature of application and functionality also demanded more background processing because we have strict guidelines around latency and customer experience.

This blog details the steps we took to manage this issue better:

  1. Analysis
  2. Approach
  3. Categorization
  4. Routing using Sidekiq middleware
  5. Layering

Identifying the problems

We did a whiteboard analysis of all the problems we faced because of the large number of queues. We narrowed them down to the following:

  • Identify a queue backlog before customers start noticing the lag;
  • Optimal utilization of machines;
  • Priority execution of certain queues;
  • Number of processes assigned to a queue;
  • Configuration based on traffic patterns in each DC.

For example, one of our customers in the food delivery business sees an increase in orders during weekends and, hence, we would get a proportional increase in traffic from them on weekends. They would trigger exports, run reports on metrics, and assign tickets via round-robin—all of which run in background workers and have to adhere to strict service level agreements. On average, we process about 40,000 jobs in an hour from them.

The right approach

We started drafting various options to solve these problems:

  1. Reducing the number of queues;
  2. Grouping jobs;
  3. Alternative to Sidekiq.

Reducing number of queues

This option was to see if we could make some of the background processing in the transaction itself. Processing a job in the foreground would make the task easier in terms of handling failures or delays, but it would bring its own problems such as increased response time, and handling retries in transactions. We have very strict SLAs for our response time and hence chose not to go ahead with this option.

Grouping jobs

The option of grouping multiple jobs together in a category could help us reduce the number of queues to maintain. This option would help us solve our primary maintenance problem. For example, some jobs don’t make many database queries. They might be sending out an email or pushing data to downstream systems or hit a webhook configured by a customer. On the other hand, some involve heavy DB queries such as running automation rules and updating the database, exporting data triggered by a customer, and running SLA escalations.

Alternative to Sidekiq

Sidekiq was performing well for our needs. We did not find any problem with Sidekiq itself. Rather, we had an application as big as Freshdesk that served lots of customers with strict SLAs and response times. This problem isn’t going to be solved if we move to a new queueing system. So we were not really keen on this approach.

We decided to go with grouping of queues into buckets and categorized them.

Next step: Categorization

After looking at what every queue does, we classified them into 10 buckets based on their type of work and with self-explanatory names such as ‘realtime’, ‘occasional’, ‘scheduled’, and ‘external’. The buckets are based on a combination of factors like RPM, runtime, and SLA. These again change based on the DC and the environment the application was running.

Here is the list of categories and the number of queues inside each category. Our staging setup was minimal and we had fewer buckets in staging. We had our business logic generic that all it needs is an array of buckets and a YAML file that maps the buckets to queues. (YAML is a human-friendly data serialization standard for all programming languages.)

BucketApprox. no of queues inside a bucket
Realtime40
Occasional30
Scheduled20
External20
Bulk jobs10
Maintenance50
Email10
Archive5
Central10
Free10
Trial10
Exports10
CommonRemaining queues

Routing using Sidekiq middleware

Before we get into the details on how we implemented this categorization, let’s look at how the application pushes jobs to Sidekiq and how it gets picked up. When an app pushes a job to Sidekiq, the Sidekiq middleware kicks in and pushes it to Redis with the given queue name. On the consumer side, the Sidekiq server polls jobs from Redis at frequent intervals and executes them.

We decided to use Sidekiq’s middleware feature to build categorization. We wrote a middleware to override the queue name with its classification name before pushing it to Redis. The queue name to classification mapping is stored in a YAML file and we cache the lookup to avoid frequent reads. Here is a small gist that finds and changes the queue name dynamically.


def queue_from_classification(queue_name)
# cache locally for 5 mins to avoid redis calls for every push
members = []
begin
members = Rails.cache.fetch('SIDEKIQ_CLASSIFICATION', expires_in: 5.minutes) do
$redis.smembers('SIDEKIQ_QUEUE_CLASSIFICATION_LIST')
end
members.include?(queue_name) ? SIDEKIQ_CLASSIFICATION_MAPPING[queue_name] : queue_name
rescue StandardError => e
Rails.logger.info "Error in fetching classification... #{e.message}"
members.include?(queue_name) ? SIDEKIQ_CLASSIFICATION_MAPPING[queue_name] : queue_name
end
end

The SIDEKIQ_CLASSIFICATION_MAPPING in the above code snippet fetches the values from a YAML file during app initialization once. The code is written in such a way that it can be different for any environment such as staging vs production and just requires the YAML file to categorize them accordingly.

Sequence diagram of how a request flows from app to Sidekiq

Routing flow before and after the change

Before

After

Layering for optimal utilization

We wrote Chef recipes to create a layer for each Sidekiq classification and Sidekiq listens only to that queue in a particular layer. Using this method, we were able to create 10 layers and removed extra ones that weren’t needed. As we knew the type of job that each queue category does, we were able to boot an appropriate instance type in each layer, leading to optimal utilization of all machines.

Finally, higher overall utilization

We took this live and after a week we found that we had higher overall utilization and needed 20% fewer infra to handle our load. Based on the traffic at each of our DCs, we were able to allocate resources accordingly.

For debugging and maintenance, we added prefixes of machine types to the logs. This, along with the ability to set proper alerts, helps us isolate and debug an issue quickly. The icing on the cake was the fact that developers didn’t have to change the way they wrote a new background worker. The only additional change required was to add their queue to a YAML file when their code went for testing.