How we cracked the code of everyday deployment

Freshworks is a company that lives and breathes SaaS. We are blessed with customers who were mostly born in the age of the Internet and expect us to roll out updates at a constant clip — problems must be fixed shortly after they have been identified and gratifying feature advances must be regularly rolled out.

When I was building on-premise enterprise products, users reluctantly accepted a waiting time of 6-12 months for a version update, specifically, if they could not negotiate receiving a fix-pack sooner. However, these days, the expectations are a lot different.

I belong to the engineering team within Freshworks that builds a Developer Platform to enable developers around the world to craft integrations and productivity apps and completely new experiences within the Freshworks products. Our primary product is the Freshworks Marketplace. This post is the story of our learning on how we matured into a team that deploys as often as every day.

                                             

Deployments by the team over a period of 18 months

Coming to Life

The Developer Platform took its baby steps as a branch off our marquee product Freshdesk. We were no more than just another squad in Freshdesk engineering. While most of our customer-facing code lived within the Freshdesk codebase, the developer-facing code stayed independent. Our agility and pace could not be separated from that of Freshdesk. We were happy to ride on Freshdesk’s coat-tails, watch our first few users build apps, and revolutionize their Freshdesk experience.

As Freshworks started building more products, newer products also wanted in on the apps strategy. Before we knew it, we had to transform into a multi-product platform ourselves. We had to develop the ability to move faster and serve multiple products. As new use-cases opened up, app developers demanded a rapid cadence. We had to find a way to deploy fast and keep up with the demand from within and without.

The Bottlenecks

Because hindsight is 20/20, we can conveniently look back and identify the obstacles that stood in our way at the time. All bottlenecks were not simultaneously divined by our team but were essentially unraveled over time. We believe that most teams attempting to make the journey towards everyday deployments will run into similar challenges:

  • Code Integration Workflows: How does source code move from an engineer’s laptop to production, and what are the puddles and pit stops it must navigate to get there?
  • Automation: What proportion of the steps involved in moving code — especially deploying to test or production environments — is automated?
  • Culture: Does everyone feel empowered to make calculated decisions and deploy code whenever they feel confident?
  • Architecture: How loosely coupled is your architecture, and what degrees of freedom does it afford your engineers?

As a team that drives 24×7 businesses, we have an added responsibility — move fast, but do not break things.

Let us pick through each of the bottlenecks:

Code Integration Philosophy

Like almost everyone in the industry, we used to follow the widely popular GitFlow feature-branching philosophy as the basis of our integration workflow. In brief, GitFlow prescribes building and testing changes in a feature-branch that is merged into a stable branch such as ‘develop’ when appropriate. After enough features are merged and integrated, a release is pulled off by merging develop into master.

GitFlow presented a couple of challenges for our team:

  • Branch Test Environments: Each feature-branch had to be deployed and tested in its own test environment. The number of features we could test concurrently was a factor of how many environments we could manage and not a factor of how many    engineers were available. This was a challenge, given that our environments were mostly an extension of Freshdesk, but we were not really building Freshdesk. Keeping the environment functional was a constant game of catch-up. Setting up environments from scratch each time was too cumbersome.
  • Painful Merges: With features being rolled out at a rapid clip, merges into the stable branch (called pre-staging) were always riddled with conflicts. Resolutions often took too long or led to regression. To minimize regressions, we ended up double-testing features in pre-staging. This created back-pressure, threatening to slow down developers.

 

Three independent features merging toward production using our version of GitFlow

       As we slowly pulled away from Freshdesk, we were free to choose a philosophy that suited our team better. Our platform was primarily built on a service-oriented architecture (as opposed to a monolith), and our team was way smaller than Freshdesk’s. We realized we have the advantage of a low center of gravity (usually termed as agility in the software engineering parlance) and decided to adopt Three-Flow.

The Three-Flow simplifies the GitFlow branching strategy to optimize for identifying merge conflicts significantly earlier in the development process.

             

Features merging toward production using Three-Flow

 

Three-Flow promptly imparted the following benefits:

  • Single Testing Environment: We needed to manage only a ‘develop’ environment to test all changes committed by engineers.
  • Immediate conflict resolution: Instead of learning of conflicts after a test cycle is complete, engineers identify potential conflicts as soon as they begin to commit to ‘develop’.

Three-Flow does require greater developer discipline in terms of predicting conflicts and allowing code to ship to production behind feature toggles. A lack of discipline must be compensated with over-communication, and a delicate shipping dance of the commits lined up. In our experience, this was still a lot smoother and hassle-free compared to what we used to have.

Tooling and Automation

As a majority of the work involved in testing and deployment was manual, we were going to struggle to prepare the changes for deployment, and then ship them as soon as they were ready. Fortunately, it turned out that there is a tipping point in the degree of automation that can be achieved, beyond which frequent shipping becomes possible. In our experience, we found the following factors to be essential to reach the tipping point:

  • Ship baby ship! This is the name of the stage in each of our Jenkins pipelines that deploys the changes made to a single microservice to an associated environment, including production. It had to be simple and reliable for anyone on the team to release a change.
  • Code as configuration: Too much of our configuration was explicit and outside git, so much so that a large fraction of deployments required manual setup before ‘Ship baby ship!’ could run. We began the exercise of pulling configuration back into each microservice’s code repository, except if it was a global environment setting or a secure parameter (those belonged in more secure key management stores). Eventually, most deployments were completely automated.
  • Regression automation: Changes that make it to the staging environment must quickly find their way out to production. After most automation suites passed 85% test case automation, we found ourselves pushing changes in and out of staging within hours. The catch was to maintain this level of coverage even as new features continued to roll out.

                     

Stages in an individual deployment pipeline in our Jenkins server

Once a team gets used to deploying with automated deployments, every new service starts rolling out with a fully-automated pipeline, leveraging existing templates.

Architecture and Degrees of Freedom

When a change is ready to be deployed to production, what typically stands in its way is conflict — another change vying to be deployed, but could potentially disrupt the original change. These conflicts are usually resolved by identifying a sequence in which the associated changes can be rolled out without disruption. This, of course, comes at the cost of holding up your release pipeline and applying back-pressure. Productive engineers dislike back-pressure.

Modular architecture

Photo by Raphael Koh on Unsplash

When changes are separated by modules or components or better still, service boundaries, the probability of conflict is significantly diminished. The more granular your code-base and deployment architecture, the smaller the chance of two changes conflicting with each other. This does not imply that a highly fragmented architecture is necessarily a good thing — comprehending, managing, and implementing changes becomes increasingly challenging with increasing fragmentation.

As mentioned earlier, we were fortunate to have started with a service-oriented architecture. The architecture allowed us to plan changes such that they are distributed across our services and are shipped independently whenever possible.

Coupling

We did have another set of problems stemming from coupling with a Freshworks product that consumes our services. Unfortunately, our initial architecture was too tightly coupled with Freshdesk. This mandated deploying in lock-step with Freshdesk, and we had become used to that. As we turned our eyes to onboarding more Freshworks products, deploying in lock-step across multiple products was clearly a non-starter.

We consciously decided to undertake an architectural cleansing along the many interfaces to a Freshworks product that the platform will have in production. This has been a mammoth undertaking — we aren’t done yet — but each decoupled interface or service has afforded us an additional degree of freedom. With each decoupling, we are able to deploy more often when we wish to, rather than coordinating with a deployment from a Freshworks product (or worse, multiple Freshworks products).

Investing in sufficiently modular and decoupled architecture happens to be a critical piece in the puzzle of getting to deploy every day.

People and Culture

Among all factors that enable frequent deployments, the culture of finding joy in shipping changes is probably the most important, yet most-neglected aspect. A lot can be achieved by a group of engineers driven by a singular mindset to deploy as often as they can. Over-communication resolves conflicts, stringent checklists replace missing automation, and smart risks take over from over-indexing on end-to-end testability.

Photo by NESA by Makers on Unsplash

This kind of regular hustling would, of course, lead to fatigue if eventually a majority of the mundane work is not automated, and recurring hurdles are not addressed. This makes a team and its culture a necessary but insufficient condition for sustained deployment nirvana.

A culture of deploying as often as possible demonstrates itself in myriad ways and cannot be simply prescribed. Here are some examples of how it panned out in our team.

  • Bite-sizing: This is probably the game-changer of a mindset. Bite-sized chunks are independently deployable changes that will integrate into production in a future deployment. Large, end-to-end integrated work items lead to elongated test cycles, apply back-pressure on adjacent changes, and end with painful, big-bang deployments. On the other hand, an approach that breaks the large work item into bite-sized chunks is relatively hassle-free, and moves faster. Over time, the team begins to learn the nuances of bite-sizing — deploying interfaces separately from implementation, shipping unreachable code, and hiding features behind a toggle.
  • Checklists: Each deployable change is associated with a documented checklist that describes the change (mostly using links to the issue tracker or git) and specifies the steps that must be followed to deploy it. Besides mitigating human error, the checklists allow anyone — not just the story owners — to deploy the changes.
  • Notifications: Each deployment is notified on the team’s group chat, informing stakeholders that a change has gone live. Celebratory emoticons often accompany important releases.
  • Deployment Report: Daily stand-ups begin with a roll call of what was deployed the previous day and what is expected to be deployed on that day.
  • Friday Deployments: Most deployments become so mundane that they are a non-event unless major infrastructure overhaul or migrations are necessary. Deployments on a Friday or the day before a long holiday no longer need much audacity or introspection.

                     

Chat notifications cheering a deployment

To Nirvana

A continuous deployment journey is never really done. Your team matures and thirsts for greater automation, new features undo some of the good work done through architectural changes, virtualization technologies evolve and force a rethink, or you just become restless and aspire to rise to a higher deployment frequency. We are far from done ourselves.

The following goals are of particular interest to us in the coming months:

  • Enable taking (some backwards compatible) changes from developer to production in a completely automated fashion, without human (specifically QA) intervention.
  • Enable testing in a completely decoupled developer environment where Freshworks products and external services are replaced with mock services.
  • Enable taking any change into production in a completely automated fashion, after it has been merged to staging.

We look forward to sharing the next chapter of our continuous deployment journey here in the near future.

In the meantime, what would be your journey towards Continuous Deployment?