Our lessons from Kustomize optimization for Freshservice

Freshservice, the cloud-based ITSM service desk software from the Freshworks suite of products, helps companies manage incidents, assets, and other information technology infrastructure library (ITIL) features. It is deployed as containers, which are managed by Kubernetes. The manifest files for Kubernetes are created and managed using kustomize.

Kustomize

With kustomize, we can create raw, template-free YAML files for multiple purposes, leaving the original YAML untouched and usable. We can manage configuration variants using overlays that modify a common base, such as development, staging, and production.

Refer to the following structure:

.
├── manifests
│ ├── base
│ │ ├── app
│ │ │ ├── base_template_for_deployments.yaml
│ │ │ ├── kustomization.yaml - has list of deployments defined
│ │ │ ├── base_patches.yaml
│ │ │ ├── generators for template
│ │ │ │ ├── deployment_1_arguments.yaml
│ │ │ │ ├── deployment_2_arguments.yaml
│ │ │ │ ├── ...
│ ├── staging
│ │ ├── base
│ │ │ ├── kustomization.yaml
│ │ │ └── staging_config.yaml
│ │ ├── app
│ │ │ ├── kustomization.yaml
│ │ │ └── staging_patches.yaml
│ │ │ └── staging_replicas.yaml
│ ├── production
│ │ ├── base
│ │ │ ├── kustomization.yaml
│ │ │ └── production_config.yaml
│ │ ├── app
│ │ │ ├── blue
│ │ │ │ ├── kustomization.yaml
│ │ │ │ ├── production_patches.yaml
│ │ │ │ ├── production_replicas.yaml
│ │ │ │ ├── blue_config.yaml
│ │ │ ├── green
│ │ │ │ ├── kustomization.yaml
│ │ │ │ ├── production_patches.yaml
│ │ │ │ ├── production_replicas.yaml
│ │ │ │ ├── green_config.yaml


The structure works great for putting common settings at a base level to reuse and reduce redundancy and putting specific settings only where needed. Each kustomization.yaml can refer to other settings at its level. Each kustomization.yaml file can have a base level kustomization.yaml as its base settings. There is one template at the base level for deployments and various generators for the template file, each specific to a different deployment, collectively forming the application we want to deploy.

The problem

We have multiple versions of the application running in each environment, each with different deployments. But the list of deployments for the application is defined at the base level as a specific child-level file (non-kustomization). It cannot directly refer to a base-level similar file. This means that deployments/base/app/kustomization.yaml, has the complete list of generators defined, from deployment_1_arguments.yaml to deployment_n_arguments.yaml.

We cannot specify in production/app/blue/kustomization.yaml the required list of deployments for that environment, as it is a direct reference to a non-kustomization file from the base. The deployment template is common, so it exists at the base. So all deployments will be created for all environments by default, and we can limit the replicas to 0 for the unneeded deployments. We decided to go one step further and delete the extra deployments in the production patches file so all environments will only have the required deployments.

The optimization

As the number of deployments, the number of environments, and the size of each environment grew, the codebase became too complicated, with many redundant deployment creations and deletions with patches. We decided to optimize this runtime patching. We put a hook at deploy time to refer to the list of deployment generators defined at each specific folder and paste it to the base kustomization file, thereby removing the patchwork.

Now each folder will have a generators.yaml, which will have the list of deployment arguments required, which is pasted with a script into the base kustomization, triggered by an external deployment hook, circumventing the kustomize limitation and leveraging specific generators. So we now have generators, too, defined at each folder level for the same base-level template. The optimization removed a lot of unwanted code and made our deployments better.

engineering optimization

The downtime

We have a load balancer exposed to the internet to receive the web requests, which will forward the requests to our internal proxy servers.Load balance for Kustomize optimization

From the proxy server onward, all are pods running in our Kubernetes platform. We have a discovery service to populate the proxy server IPs as our load balancer’s targets. The service discovers the proxy’s IPs through a label added to the proxy. The label is added to the proxy through the same production patch file (production_patches.yaml) we have seen above.

The optimization to remove all the unnecessary patches included this additional necessary patch. Among the 16,006 lines removed, 1 line which was needed was also removed, and due to the significant volume of changes, it was also missed in our reviews. It was missed in one production environment’s patch alone and was present in the other environments, so we could not catch this in staging. Upon deploying, the label was removed, so the load balancer removed this proxy server as a target as it does not have the label. Hence no requests were forwarded to the proxy server, leading to downtime. We realized this immediately and manually added the label, bringing the application back up.

Prevention measures

How can a human being check if everything is right in a colossal codebase? The answer is test cases! Like every application code framework has a framework to write tests, we found one for our Kubernetes manifests: Open Policy Agent, an open-source, general-purpose policy engine. We can write our tests using rego to have assertions that are checked against the codebase and flag any violations.

This solves many tricky problems

  • Manifest per environment is unique, so this test case solution can only catch specific issues at each level.
  • Before being applied to production, the effect of the new code can be tested against the assertions specified.
  • Can have a dry run mode before deployments by running these cases to check the effect of what we apply in production.

Implementation

To explain with an example, we’ll check how we have ensured the above problem will not happen again with a test case.

Case: the proxy server’s manifest should always have the label, which the discovery service uses to sync the IPs as load balancer targets.

Rego policies are defined in policy/<file_name>.rego, which are picked by conftest, a utility to help run these tests against the manifest generated. Save the following as policy/service.rego (for the Kubernetes service of the proxy server pods):

package main

deny[msg] {
 contains("Service", input.kind)
 name := input.metadata.name
 not input.metadata.labels.app
 msg := sprintf("Discovery label missing for service %v", [name])
}

deny[msg] {
 contains("Service", input.kind)
 name := input.metadata.name
 label := input.metadata.labels
 label["proxy.discovery"] != "true"
 msg := sprintf("Discovery label wrong for service %v", [name])
}

Here we have two checks: 

  1. Check if the label exists
  2. Check if the correct value exists in the label

Include a check in your deployment logic like the following:

kustomize build <target_kustomiation.yaml> | conftest test -

Here kustomize build would generate the manifests for the target specified and run a conftest test against that would evaluate all the test cases under policy/*.rego. So even if there is a miss in a patch or any other required setting, the test cases would catch it. 

We have around 30 test cases applicable against each manifest and the time taken to run all the tests is negligible, in the order of a few milliseconds. The time taken for production rollout is not affected by the test suite.

Teams planning to use this can go through their architecture and manifests in Kubernetes, develop a comprehensive list of test cases, refer to the open policy agent and conftest documentation and implement all those test cases using rego. Include it before your kubectl apply logic to ensure that none of the cases you have ensured fails again.