Caching ActiveRecord Objects efficiently
[Ruby on Rails is a great web application framework for startups to see their ideas evolve to products quickly. It’s for this reason that most products at Freshworks are built using Ruby on Rails. Moving from startup to scaleup means having to constantly evolve your applications so they can scale up to keep pace with your customers’ growth. In this new Freshworks Engineering series, Rails@Scale, we will talk about some of the techniques and patterns that we employ in our products to tune their performance and help them scale up.]
Freshservice is a cloud-based IT help desk and service management solution that enables organizations to simplify IT operations. Freshservice provides ITIL-ready components that help administrators manage Assets, Incidents, Problems, Change, and Releases. The Asset Management component helps organizations exercise control over their IT assets.
Freshservice is a SaaS product powered by Ruby on Rails. It is backed by a MySQL database for persistence storage and uses Memcached  and Redis extensively for caching and config storage purposes. Commonly used data types like String, Integer, Hash, PORO, and ActiveRecord objects are cached for different use cases. For example, data that we access more often—& updated less frequently—like tenant preferences, custom form fields, etc are cached to Memcached for fast access.
Freshservice was initially built as a monolith, meaning there are no boundaries between distinct functionalities in the codebase. As we build more functionality, the codebase grows. For better organization of code, we started refactoring the codebase into components. We had to change the namespace of classes. In simple terms, a class named OldModule::OldClass was changed to either NewModule::NewClass or NewModule::OldClass. It worked out well, but we learned some lessons about caching ActiveRecord objects along the way.
Once we deployed the changes to our non-live servers, we started noticing Dalli::RingError: No server available for every Memcached#get calls for internal traffic. At first, it looked like our Memcached servers were down. We can confirm that the Memcached servers were up and running. Taking a look at the client library code from where this exception was raised, it became clear that the exception was raised because the Dalli client assumed that the server was down for whatever reason. So we had to hold rolling out this change to production till we found the root cause. This issue went undetected in testing because we usually start cache on a clean slate for our tests.
Note: We were using Dalli(2.7.8)  gem as a client for accessing Memcached servers.
It took a while to drill down into the problem, primarily for two reasons. Firstly, none of the refactored classes were being cached. Secondly, random reads from cache were failing and it took us a while to identify any patterns. After some deep dive into the gem’s code, we noticed that if the failure count exceeds the socket_max_failures config , it would assume the servers to be unavailable until down_retry_delay seconds  without making another call to the Memcached servers.
We next approached the second error, which were random cache read errors. We analyzed some raw cached objects and after manual analysis, we were able to identify that, in some cases, the refactored classes were cached. This was causing NameError for the refactored classes while deserializing objects since these classes were no longer present in the old namespace. Dalli was peculiarly handling the error: it could be raised as NameError, instead of a network error.
We now know the issue, but how were those classes serialized in the first place?. Let’s discuss that with a sample app . This app has a few model classes Tenant, DomainConfig, Subscription, Ticket, Tag. Ours is a multi-tenant app. All the classes have a belongs_to relation with Tenant. Most of the models are loaded through the Tenant association.
Check out the tag association_cache  and run Script 1 in the rails console. This will warm up the cache with few objects. We are loading a few associations of the Tenant object before we cache DomainConfig.
irb> tenant1 = Tenant.first irb> tenant1.subscription # Load tenant subscription irb> tenant1.clear_domain_config_cache # Clear from cache if already exists irb> tenant1.domain_config_from_cache # Load DomainConfig through cache
Now checkout to association_cache_refactor  tag and load data from the cache using Script 2. In this tag, we refactored the Subscription class to Billing::Subscription (association_cache vs association_cache_refactor ). This will load the DomainConfig object cached by Script 1.
Tenant.first.domain_config_from_cache # Load DomainConfig through cache
You can see the error message Unexpected exception during Dalli request: NameError: uninitialized constant Subscription.
From the error message, it is clear that the Subscription object was also serialized along with the DomainConfig even though we never intended to cache this object. This is because any association that is loaded will be cached to the @association_cache  instance variable, and it will be serialized when doing Marshal.dump(object). (Marshal is the default serializer in Dalli). You can load other associated objects before caching and check the serialized data.
Note: Dalli fixed this error handling in PR #728  and is available from version 2.7.11 to raise with an appropriate error message.
To avoid serializing the association objects we should make sure that we clear out the @association_cache from the tenant object, before calling Cache#set. But clearing @association_cache may have to penalty DB queries in that transaction. To avoid this penalty, we went ahead with directly loading the ActiveRecord object.
Instead of fetching DomainConfig as tenant.domain_config, we would fetch it as DomainConfig.find_by(tenant_id: tenant_id). By doing so, no other association can be made with DomainConfig and there is no penalty query for the tenant’s other association. The downside would be that if the association was already cached to the tenant, it would still query from the DB. Since this will happen only during a cache miss, this is the trade-off we made when compared to caching all the loaded associations.
We made this change in the entire codebase and introduced a new namespace for Memcached keys so that the instances running the newer build version will warm up the cache. The values cached in the new namespace will not have the association objects serialized. Also, we can clear all keys from the old namespace once the new build version is deployed in all instances.
To make sure this solution works, checkout to ar_direct_load  tag and run Script 1 again. This will warm up the cache in a different namespace (association_cache vs ar_direct_load changes ). Then checkout to ar_direct_load_refactor  and run Script 2. This time we won’t get NameError for the Subscription class, because it is not serialized to cache.
We first deployed the changes to load ActiveRecord objects directly. We observed for a week after this deployment and we could see that the Memcached#get query time was significantly reduced.
Note: The change in query time depends on the size of unnecessary association objects used to load before.
Below are the screenshots of a few transactions showing a reduction in query time.
We cleared all the keys of the old namespace from Memcached a week after the first deployment. In comparing the item size before and after we can see around 50% reduction in the average item size.
From the above observations, it is evident that loading the ActiveRecord objects directly caches only the required object. This helps us reduce the size of the cached object, which will reduce the time taken for serialization & deserialization. This reduction in size and object deserialization time helps us to bring down the latency of Memcached get calls, thereby improving the performance of the whole application. One shouldn’t blindly cache any object with the assumption that it would improve the performance. We should evaluate the use case, need and validate the necessity of the data being cached. The cache should be meant to speed up retrievals and shouldn’t be detrimental to the overall performance.
Subscribe for blog updates
Thank you for subscribing!
OOPS! something went wrong try after sometime