Delivering delight and tracking it by the millisecond

At Freshdesk we are obsessed with providing a great experience to our customers. Hence, speed is critical to us. To achieve this, our front end, where assets are loaded and painting of the UI is done, as well as the back end, which serves data to the Web and API clients, should perform adequately. In this post we explain how we at Freshdesk track performance from a back end perspective. 

Freshdesk is predominantly built on ruby on rails. Over a span of 10 years the codebase has grown bigger and it is serving more than 1600 URLs(Controller#action). We wanted to ensure that adding new features and improvements did not degrade the application’s performance. The customer should be delighted while using our application and should not face any distraction due to the application being slow. With a large codebase, where over a hundred developers add features and make improvements to the product in each release, performance degradation is more or less inevitable. As performance is the key to customer delight, we started to track it as Delight Metrics. 

We do a Delight Metric, a.k.a performance review, of our application every week. We track and report the performance by modules like dashboard, ticket list, and configs, and each module is owned by a different development team. Each module has its own set of URL routes. If a module’s performance has been degraded, one or more URLs in that module start performing badly. This could be because of new code changes pushed to production in one of the last weeks’ deployments, or possibly other issues including infrastructure constraints. The corresponding team will debug the issue and fix it immediately. 

As of now, we track more than 85% of Freshdesk’s traffic, served in our top 10% of routes. This way, we can keep a tab on our performance and provide a better experience to our customers.

Our tracking method

For each HTTP request, we generate access logs and write to a file called application.log, which contains details such as the controller and action it hits, the originating IP of the request, the time taken in various components such as DB, Redis, Memcache, the number of Redis calls and Memcache calls as a part of the request flow, and the total duration for processing the request; all recorded in comma-separated values(as shown in image below). It also contains details such as the account id and user behind the request.  We rely heavily on Redis and Memcache and, hence, it makes sense for us to log their durations as well. With the help of the Ruby library time_bandits, we calculate the time spent in Redis and Memcache, which helps us debug performance issues.

We run a logrotate container as a sidecar along with an app container. Logrotate takes care of pushing the application logs accumulated over the last 15 minutes into a bucket in S3. Baikal, our data lake platform, pulls the application logs from S3, parses and ingests them into a table called application_logs.

We have our application access logs available in a queryable format for analytics and reporting. We also set the right threshold for each controller action pair (each route). For example, Ticket Show API has to perform within 150 ms(milliseconds) for all our customers.

Defining thresholds

Based on our product team’s analysis and suggestions, we set the back end thresholds very carefully. The analysis goes as below:

There are three main time limits (which are determined by human perceptual abilities) to keep in mind when optimizing web and application performance.

  • 0.1 second is the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.
  • 1.0 second is the limit for the user’s flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary for delays between 0.1 and 1.0 second, but the user does lose the feeling of operating directly on the data.
  • 10 seconds is the limit for keeping the user’s attention focused on the dialogue. During longer delays, users drift to other tasks, so they should be given feedback indicating when the computer expects to finish the task. Feedback during the delay is especially important if the response time is likely to be highly variable since users will then not know what to expect.

(Also read: Powers of 10: Time Scales in User ExperienceBest Practices for Response Times and Latency)

Apart from this we also had made the following assumptions while defining thresholds at server side:

  • We need to add a round trip time to the URL’s threshold limit to arrive at the final user interaction time. Looking at the average AWS latency and assuming that the user is located in the nearest of our 4 data centers, a few hundred ms should be the appropriate round trip time we can add to the API response time in server.
  • The UI rendering also takes time, especially for a larger number of elements being rendered.
  • Freshdesk operates in a way that the first load might take time, but after that each interaction is more like a diff on top of the existing application on the browser.

Based on these, we categorize the back end thresholds into  four buckets(numbers are indicative).

1) Some actions need to seem instantaneous. We should have extremely low thresholds for them. The user should not feel like they waited at all. For these actions we define stricter thresholds.

2) For actions that happen often but users wait for their completion impatiently, the delight threshold is 200 ms.

3) For actions that happen often but the user is aware that a list is being loaded, the delight threshold is around 600 ms.

4) For actions that carry on in the background–not initiated by, and without impactful, to the user and the workflow–the delight metric is 1500 ms.

Report generation

We have a table of thresholds that maps them for each controller#action with the Module it belongs to. 

For example:

 

We aspire that for every action 95% of our requests fall within the thresholds to consider ourselves to be performing better for that action. We calculate that metric using the below sample query:

SELECT round(100 * SUM(IF(al.duration > t.threshold, 0, 1)),2) / COUNT(1) AS delight,
       COUNT(1) AS no_of_requests,
       al.controller,
       al.action,
       t.module AS module
FROM application_logs AS al
JOIN thresholds t ON al.controller = t.controller
AND al.action = t.action
WHERE created_day BETWEEN ${START_DATE} AND ${END_DATE}
  AND status = 200
GROUP BY al.controller,
         al.action,
         t.module;

Our reporting is also along this line. We calculate what percentage of requests fall within the thresholds for each action and report it module-wise by taking the weighted average of actions mapped to that module. We use our in-house tool Supreme to generate this report every week. A sample report looks like the one below(the numbers in the screenshot are indicative).

 

A weighted average will show a dip if two or more actions under it, or a single action with a huge number of requests, performed badly. This way, we could focus more on important problems.

For example, the performance of the last module was bad in the first two weeks of June because the weighted averages were only 88.73 and 88.81 respectively, as shown in the screenshot above. So, we had to examine the module to see which actions were performing badly, and fix the issues associated with it.

Supreme also provides us with weighted average metrics in daily reports, which helps us to know since when an issue has been existing if at all there was a performance dip. Apart from this, Supreme also provides us with various high-level metrics like the weighted average of all the modules in the last 90 days as well as the last 28 days, which gives us an overview of our overall performance in the last month, and over the last three months. We get region-wise metrics as well, as it is important to find out the regions where customers could be facing slowness in case of any issues.

In this way, we are able to detect performance issues and act upon them sooner. This is a win-win for our customers and ourselves. It helps us to keep our infrastructure at check. We don’t need to increase infrastructure and, as a result, incur more costs because of performance issues. At the same time, we are not merely putting Band-Aids on performance issues brought up by customers–we act proactively, which keeps our app’s performance good in the long run. This helps us develop a culture, internally, to be mindful of performance even while developing products, a means to winning great customers and having them for life. 

Should a report show a dip in performance, how do we debug where the problem is? We will let you know soon.