Building a scalable app in production

In previous posts about building and deploying your application in production, we’d covered observability and availability

We focussed on the importance of these in having your application deployed in production. A highly observable and available application can keep growing as it evolves with more features and changing architecture. Now, when your customer base is increasing, the throughput your application handles will also increase. 

What is scalability?

Scalability is the property of a system to handle a growing amount of throughput and data volume. It’s the ability to efficiently handle requests received at higher rates without letting availability and performance take a hit. A scalable application ensures that the functionality and performance would not be affected, irrespective of throughput or data volume.

Why is scalability important?

  • More data and more throughput can lead to scalability issues, which in turn impact the SLAs defined for performance and availability for any business/application.
  • If your application has scaling issues and it is addressed only by provisioning more infrastructure, the cost  increases significantly.
  • If the application doesn’t scale well, customer delight is impacted, which curbs the growth of existing customers and also impacts getting more customers. 
  • If you have a lot of non-scalable services in your application, it would bring about a high churn rate among your customers. 

The ideal state is to have your SLAs remain the same, despite your application witnessing higher throughput and data inflow as it scales up. 

How to scale your application?

A lot of planning goes into this. You cannot make your application scalable in a single shot and  there are many aspects to consider: scaling at various layers (database, compute, network, etc), horizontal vs vertical scaling, design related improvements, and so on. You need to do capacity planning, ensure you have enough observability to track usage of resources in all your layers and perform load testing on them to see how far they can meet the scaling needs. Let’s take a look at some of the best practices.

Scalability in design

Plan properly: Try to get an understanding of your product, customers and how a typical workload would be. Choose the best tech stack that would suit your needs: 

  • Choose a known framework with a solid programming language and a mature community around it. Observe its learning curve and also the expertise available at hand. 
  • Evaluate your database needs. Check if your workload would be more suited to a relational database or if NoSQL would be better.
  • Employ caching and asynchronous communication (discussed later)
  • See if you can decouple the frontend and backend with a queue. Can scale each independently with your queue being a buffer.

Visualise a modular, loosely coupled architecture: Smaller, independent components organised into microservices offer better understanding, easier management and better scalability. There would be higher flexibility in doing things for each component separately, and hence, each can be scaled independently. This would not be applicable for all workloads though; there are cases where a monolith would do better. Evaluate your workload to see which solution would fit it best.

Bottlenecks: Identify slow running transactions, queries and work on optimising them. Identify limits to various resources you have provisioned and have a plan on what you will do when the limits are reached.

Statelessness: If you are using a microservice architecture, identify services in your workload that can be stateless. It becomes inherently fault-tolerant and scales well.

Scalability in application layer

Clean code: This is a very important aspect. Ensure that the code run is optimal. Run benchmark tests to see how well your code (eg. a particular method) runs on scale. Use static code analysis tools (linters) to flag programming errors, bugs, stylistic errors and suspicious constructs.

Decouple frontend: Utilise use cases where the code can be moved to the front end and processed there, making your backend lighter. Evaluate at the design state itself if any feature can be built as a single page application separately.

Decouple database: Don’t connect to the database and don’t execute transactions unless you have to. Plan and choose where you want your data (process memory / cache / DB, etc) and  design  accordingly. Do not have every single thing in the database and query for that data every time your process needs it—for example, don’t store transient data like user sessions in the database.

Asynchronous communication: Not every functionality of your application needs to be a part of the frontend requests the user makes. See how logically you can move functionality to asynchronous processing (eg. sending a success email after a long running export or import operation), out of the main request cycle, so you can improve performance and scale each independently and remove the dependencies.

Pipelining: Parallel and out of order execution techniques come in handy when you have a series of operations to do in a transaction. Consider a process making multiple calls to a service/cache in a transaction. Instead of making one call for each data get,  explore if you can get all data in one go and make it just one call per transaction.

Scalability for databases

Choose the right database:

  • Understand your data characteristics
  • Evaluate the available options
  • Choose data storage based on access patterns and optimise later
  • Check in your database about schema complexity vs query complexity, need for vertical vs horizontal scaling and make a choice between SQL and NoSQL databases.

Master + Read replica: Write queries go to a master and read queries go to a live replica. Each can be scaled separately, according to traffic patterns. 

Caching: Identify frequently loaded objects, slower queries, and so on, and cache them. Frequent and repetitive reads can be cached for better performance.

Indexing: This will reduce the time your database needs to locate and load the data for a certain query drastically.

Vertical partitioning: Vertically partition your database tables to contain fewer columns. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.

Horizontal partitioning: Horizontal partitioning (sharding) divides a table into multiple tables that contain the same number of columns, but fewer rows. So, subsets of data can be split into separate and smaller tables / databases, making it easier to query. A detailed solution about sharding and rebalancing data between shards can be found here.

Scalability in infrastructure

Importance of load testing: Perform regular  load tests on your application and identify transactions / resources that perform poorly under scale, bottlenecks/single points of failure that occur in your architecture, and the limits for various resources. With this exercise, you can come up with a plan to scale each resource properly.

Load balancer: Use a load balancer to balance the requests between your application servers based on their capacity. Also, ensure your servers have resources provisioned specific to the type of requests they get. For example, some requests are memory intensive and some require more CPU, so ensure the servers are tailored for that.

Containerisation: Build container images for your application and services  and run the processes required as containers for easier and faster horizontal scaling. Running applications as containers also lets you restrict the resources it can consume and tune access to the host for better performance and security. Deploy these containers through an orchestrator with auto scaling rules configured (discussed later).

Leverage CDN: Static content can be delivered directly to the customers without any of your application servers processing  those requests. The static content will be delivered through a CDN network to the users. CDN managed services also offer caching in edge locations to improve performance. 

Infrastructure as code: We have come a long way from “requesting” a new hardware server to be provisioned, to easily “spawning” virtual machines. How convenient would it be to go all the way and deploy your application in a new environment with just a few clicks / commands? Get your infrastructure ready as a modular, declarative model of code and provision the resources in a data center by just applying it. Employ version control over this code to track how your infrastructure evolves with time. This also helps in making a recovery from failure faster. Integrate your application’s release management with infrastructure too. This lets you spawn new environments easily and also helps scaling.

We have a blog post about infrastructure as code coming up soon. The link will be included in this article.

Horizontal vs vertical scaling

Horizontal scaling (also called scaling out) is scaling by adding more individual resources to your pool of resources, whereas vertical scaling (also called scaling up) is scaling by adding more power to existing resources (e.g. CPU, memory, etc) to an existing resource.

In choosing between the two, consider the following: 

  • Cost: Both kinds of scaling increase cost, identify which suits your need at an optimal cost.
  • Deployments / updates: More the resources you have due to horizontal scaling, more to manage, deploy and perform updates on.
  • Minimum requirement unit of a server needs to be provisioned newly every time you horizontally scale. Instead of that, consider a small provision increase in an existing resource (vertical).
  • Horizontal scaling has the added benefit of redundancy.
  • Not all kinds of vertical scaling’s provisioning is supported without a restart (involving downtime), which needs to be considered.

Auto scaling

What if scaling can be done automatically without the need for manually executing the commands each time? Auto scaling is the ability to dynamically adjust the amount of resources based on load. Once you have a stable monitoring and alerting system, you can rely on that to configure rules for scaling up and down based on the need. This will increase the fault-tolerance and improve responses to throughput variations and other issues faced in your workload. For big workloads managing high throughput, the option to auto scale helps considerably in making the application highly scalable.

We have our workload running as containers and managed by kubernetes. We have deployed a custom auto scaling solution with Horizontal Pod Autoscaler (HPA), available in Kubernetes. The HPA takes scaling decisions automatically, based on metrics provided to it by  Prometheus. The scaler works on rules based on request count, and we’re planning to extend it to response times too. 

Here’s an article about autoscaling with HPA, in which we’ve discussed a method to leverage the Autoscaler to predict traffic and optimize infra costs. 

Similarly, one can look into other auto scaling options: 

  • Kubernetes Event Driven Autoscaler (KEDA) if you are using Kubernetes
  • AWS Auto Scaling Groups (ASGs) for AWS
  • Load and time based instances in AWS OpsWorks for AWS
  • Azure Autoscale if you are using Azure
  • Autoscaling with Managed Instance Groups (MIGs) in GCP if you are using Google Cloud Platform

In our next post, we will focus on increasing the availability of an application by improving performance and better-handling software failures. And, after that, cost optimization, security and more.