How feature toggles allow us to experiment at scale

[Ruby on Rails is a great web application framework for startups to see their ideas evolve to products quickly. It’s for this reason that most products at Freshworks are built using Ruby on Rails. Moving from startup to scaleup means having to constantly evolve your applications so they can scale up to keep pace with your customers’ growth. In this new Freshworks Engineering series, Rails@Scale, we will talk about some of the techniques and patterns that we employ in our products to tune their performance and help them scale up.]

In the early days of a startup, rapid development of product enhancements and features is relatively easy. As a company grows, while the need for speed persists, demand for stability increases considerably.

At Freshworks, as we continue to churn out newer features for our products, the code and our database schemas have also evolved. When we had just a handful of customers, performing an online schema migration to manage incremental changes in product features was quite straightforward, irrespective of the data store and its version. As the number of customers and the data we were trusted with increased, we started using the LHM gem to handle our migrations.

Schema migrations are a small part of the equation when it comes to how a startup product evolves. In our early days, we made frequent changes to our codebase and modules and, at times, needed to change the way we stored data.

That’s why, about six years ago, we started using ‘feature toggles’—a technique that allows for a feature being developed or tested to be enabled or disabled for specific accounts—whenever we needed a large set of data to be migrated along with a code change. The code would work only with the new data model (data in a different column/table).

Even when migrations are not involved, large features are best delivered when done in the continuous deployment model. Rolling out a feature gradually greatly reduces risk and makes it easier to catch any bugs when hiding behind a feature toggle.

We had the solution with us, something that was already baked into the code—our gem that populated features based on an account’s subscription plan. We quickly jumped into the feature toggle bandwagon. It is now integral in our software development thought process.

Freshdesk, Freshworks’ flagship product, is a multi-tenant app where we treat each account created by our customer as a tenant. When we want to enable a feature, it is always for an account and never for a single user within that account. If ever it is for a single user, we let that be controlled as a preference for that user. What an account can have and do will be based on the subscribed plan and the experiments we as developers enable for it.

Why we switched from our original feature flipper gem

When an experiment is ready for testing in production, we try it out with our test and demo accounts before proceeding to our own support account. Slowly, we release the feature to a small, handpicked set of accounts, wait a while to ensure stability, and then open the gates for all the accounts.

If one customer reports an issue with the new feature, we roll it back for that one account. If more customers report similar issues, we immediately roll back the feature for everyone.

Since the main goal was experimentation with rapid iteration, all supporting infrastructure needed to be simple and fast. Our earlier plan-based feature gem necessitated looping through all accounts to enable an experimental feature one by one, which is not a recipe for rapid iteration. More importantly, the feature revert has to be quick.

Removing code without cleaning up the features mapped to an account was also not possible due to the way the gem worked.

At times, we started using global flags set in our Redis to toggle a feature completely. Quickly, we realized that it was just polluting the Redis server and our code with a lot of spaghetti keys and checks.

Welcome to the LaunchParty

We wanted a simpler means to manage the feature toggles. We wanted the API for this to be straightforward and did not want to introduce loads of documentation in the code or the developer onboarding process.

Before even implementing the gem, we designed how it was supposed to work within the product.

We named this gem ‘LaunchParty’.

Most important are the following APIs to control:

Whenever we are on the cusp of releasing new features, we get requests from our friends in the non-developer teams—product managers/sales and support teams—to be involved. To enable them to quickly launch an experimental feature for a customer account, we added an option in our internal admin tool to launch a feature from a whitelisted set of experiments or features that can be toggled.

For experimental releases that are more about non-functional improvements, developers have full control over the rollouts. Other teams are kept informed. We have made sure that at any point the developers would be able to list all the features launched for a given account.

Not all features need to be backed by this protocol, what we now call the LaunchParty check. There is a subtle distinction on when to use this and when not to.

For features and experiments that would impact a vast majority of our customers or alter any of our core functionalities such as Ticket Automations, LaunchParty would be preferred. For new features that would have minimal impact on the existing base, and for which we want to push the code exclusively to our test and demo accounts in production environments, we prefer going with plan features.

Though not a major chunk, a few features did fall under this criteria. We remain wary of overusing this approach as we still would not have a quick way to roll back a feature if it comes to that.

The core process

We used Redis’ built-in data structures to handle most of the operations related to this.

The core method for launching an experiment looks like this:

On toggling an experiment, we make the changes in two different sets—an account-wise set and a feature-wise set.

When we fetch the list of experiments for an account, the first set is used, along with another set that holds the list that has been launched for all the accounts. Since the number of reads would be higher than any other operations here, it is important to keep this to the bare minimum number of Redis calls, which stands at ‘two’ at the moment.

When we retire or rollback a toggle completely, we use the feature level set to ensure we do that properly for the accounts involved. Since rollbacks are not done often, iterating through the list is acceptable here.

Regular rapid releases

Though we adopted feature toggles a little before adopting the agile methodology, the full extent of the benefits of the latter soon became clearer.

As an evolving product, it was important to get each new iteration of the product out to users quickly. We wanted to be able to release them right after every user story. LaunchParty and feature toggles, in general, help us in two ways on adopting the ‘Release early, release often’ strategy:

  1. Increasing our confidence in the stability of the releases, which is as important as maintaining our momentum; and
  2. Getting minimum viable versions of the feature out to our customers, enabling us to get gather regular and early feedback.

In software development, the smaller a pull request is, the easier it is to review, test and release it. But features and enhancements cannot always be built with smaller pull requests.

Multiple user stories, probably developed by more than one developer, combine to deliver a feature or an epic. Separately, these user stories would not add much real value to the end-users. With feature toggles in place, it is easier for teams to hide behind the checks and rapidly iterate to deliver a feature in the most impactful manner, and yet be nimble enough to course correct whenever necessary.

In the development phases, these feature checks are launched for most accounts in staging environments, and just for a handful of internal accounts in the production environment. Not all developers involved are given full control of the release process and its timing, especially when dealing with a large monolithic application. But when a feature goes live, it can very well be controlled by teams owning that epic.

Operating at scale

Feature checksAverage
New feature checks introduced every month5
Number of feature checks at any given time125
Requests handled per minutes
(web and background jobs)
~ 70,000
This does not take into consideration the checks in our frontend app, which has its own set.

 

Avoiding potential pitfalls

While the use of feature toggles is an absolute necessity for experimentation at scale, there are a few things to take care of to ensure you don’t end up in the deep chaos of tech debts.

These checks should always be short-lived. With its ease of use, the volume of usage can grow rapidly. Every toggle you introduce is a tech debt that needs to be taken care of.

Hiding behind a feature check does not always mean you are also safe in the security aspect. Never let these come in the way of strong and secure coding principles.

While we do we have some tech-debts on this topic, we added a technical story as a reminder to remove LaunchParty checks and similar toggles as a gating item before we close an epic and move to the next one.

When you have elaborate setups for pre-production environments, it is quite easy to miss enabling a feature that is made available to all accounts in the production, but not while doing quality assurance.

Complexity will be directly proportional to the number of feature checks. While plan-based feature checks are necessitated by business demands, the experimentation toggles are available only for your discretion.

As it did for Freshdesk and Freshworks, feature toggles or LaunchParty checks can help the developers to get changes out faster.

Unless they are wisely used, testing of an application along with its stability and maintenance would go for a toss.