The Freshworks way of making apps highly available

Your application should always be ready to serve the requests received. The measure of the degree to which your application is available to serve all incoming requests within an acceptable amount of time is the topic we are going to focus on: availability. 

Obviously, achieving a 100% availability sounds too good to be true. There are various problems that may occur. This is why your application should define a Service Level Agreement (SLA) for availability and work to meet the SLA. The SLA will provide your customers with certainty over the degree to which your application would be available. Some examples of high availability that teams try to achieve:

Consider the common SaaS architecture:

  • Load balancer
  • Proxy servers
  • Web servers
  • Application servers (VM / container)
  • Database and cache

Things can fail at each level, causing various issues that might lead to your application going down. Some common reasons: 

  • Server hardware failure (at any level in your architecture)
  • Network latencies
  • Issues with deployments
  • Software failures (code bugs / poor memory management / code not scaling)

Code not scaling could be due to the fact that your code uses too much memory or takes too long to run—which could in turn be due to inefficient algorithms, complex queries, high number of queries, not using cache, etc. Performance and scalability improvements will be discussed in the next post. Security related availability is not being discussed in this document as well, and will be discussed in a future post.

“500 Server error”

“503 service unavailable”

“504 gateway timeout ”

These are messages you wish your users never see in your application. What measures should you take to ensure high availability for your application?

1. Redundancy – avoid single points of failure

Identify all single points of failure in your architecture and look for solutions to make it more durable. Perform periodic fault tests in your architecture and see how the application fares with each. Based on that, have recovery solutions ready for each component of your architecture.

Typical single points of failure to look for:

Redundancy on application layer

  • Deploy multiple servers so your application can take more failure without becoming unavailable
  • Distribute servers across multiple zones so zone level failures / network issues can be handled.

2. Auto scaling

When the load to your application increases suddenly, your application would need to scale up to meet the requirements. At other times, when the load is low, your application can scale down to save the cost. This can also be automated. 

All cloud providers have auto-scaling solutions that can be implemented for your application’s workload. If you’re running containers in kubernetes, there’s Horizontal Pod Autoscaler for this. 

Also, remember that you cannot keep auto scaling up and down forever. You cannot accommodate spam and need to consider the cost involved while scaling. Remember to apply the correct limits needed for scaling both up and down.

3. Distribution

Group your requests into various kinds and distribute them among groups of servers. This will localize infrastructure faults. For example, consider the following distributed setup of server groups:

  • Normal requests to application to one set of servers
  • API requests to a separate set of servers
  • Asynchronous requests to a separate set of servers
  • Requests that are long running to another dedicated set of servers with different timeout settings
  • Requests that can be memory or compute heavy to different servers with appropriate memory and compute settings. 

With this, the infrastructure becomes well distributed and will be resilient to process / server level failures.

4. High availability for data stores

Deploy your data stores in master + read replica architecture, with writes going to the master and reads to a replica, performing live replication of your master; deploy both of them across multiple zones. This applies redundancy and distribution. When your master fails, have the ability to promote the replica immediately as the new master and continue to serve your application. This would help bring your application’s RTO to a lesser value (faster recovery from disasters). 

You can have multiple replicas too, based on your application’s scale. With that you can write a solution to switch between replicas automatically when one goes down. You can even try adding a proxy between the application and data layer to handle throttling, failover detection, connection pooling, query routing, etc.

Make sure your application has the ability to fail fast in case the database is unreachable and to retry with an exponential backoff. 

Take periodic backups of your data stores and store them separately. In case of bigger disasters your backups will come in handy to bring up new databases and the more frequent your backups are taken, the lesser the RPO value will be for your application. 

5. Fail fast

If your application talks to other services or resources, employ a fast failure mechanism in the application. If any of the services or resources go unresponsive, it is better to stop those connections immediately and fail the requests rather than continue serving in a flawed or slower manner. 

Employ a circuit breaker mechanism like shown in the above diagram. Circuit breaker promptly stops all calls to a faulty microservice or resource, resuming when it is available again.

6. Limits on provisioning resources

  • For requests that are intensive in compute or memory, try to bring in limits in the number of such requests. Identify how much your application can support and throttle extra requests with a rate limiting solution.
  • Keep an eye on OS parameters used by your application / dependent services that may lead to problems in scaling, eg. open sockets, file descriptors and so on. Tune them to ensure they are in acceptable values.
  • Apply limits of compute and memory to the processes you run in your servers. Restart or kill processes when they consume more. It is better to make a single problematic request fail than to make it run fully and choke several other requests.
  • You can deploy plugins (developed yourself or 3rd party) in your monitoring service to log the request’s trace before killing it, which will help you debug later.
  • Employ proper timeouts to all connections and requests made by your application. Never wait for something that is not going to work.

7. Deployments

A lot of issues from your application come up during deployments, when new changes are rolled out that breaks specifically in production. Better ways to deploy would be:

Canary deployments

Roll out changes to an initial subset of servers / users and test the impact and results. Based on success, do the same to more servers / users till the switch is complete.

Note: If you have migrations on your databases that are going to affect the compatibility with old / new versions of the code, this is something you need to verify while doing a traffic switch.

But still, no matter how fast your app boots, when you are deploying and restarting servers, there could be a few requests that need to wait. As you scale, there would be more requests getting into the queue.

Rolling deployments

This involves replacing existing servers with new servers containing the new changes, one by one. Introduce a health check mechanism to ensure a server is up properly before proceeding further.

But if your scale is too big and you don’t want any users seeing potential deployment issues, go for “blue green deployments”.

Blue green deployments

Have two sets of servers: blue and green. At any point of time, only one color is live. During deployment, bring the non-live color servers up with the new changes and test them completely, also point them to test / internal traffic (if any) and monitor for some time before proceeding to switch these servers as live. This switch can also be done in a phased manner like in canary deployments to slowly bring your changes live.

Switching can be done at the load balancer (or any front-facing proxy server) level where the live color setting is changed. Switching can be done at connection level or all connections for a tenant (customer / account) level. Take care to ensure all connections of a particular tenant go to a particular version only, so that any changes are either available or not available to the tenant, in all their requests.

You may consider the best points of these strategies and come up with one solution that suits your use case the best and architect it into your CD pipeline.

With all these measures in place, you can make sure your application is always “highly available”. 

In our next post, we will focus on increasing the availability of an application by raising performance, scalability and better-handling software failure. And, after that: cost optimization, security and more.

Cover and inline images: Vignesh Rajan