Leveraging HPA to auto-scale an app on Kubernetes

When you have a predictable pattern in the traffic to your applications, it’s easy for you to scale the infrastructure accordingly. For example, If you are running a commercial shopping website, you can expect a bump in the traffic to your application upon introducing offers or during festival seasons. In response, you can upscale your infra upfront to handle the load, and downscale it after the offer. 

However, the traffic is not predictable in certain SaaS use cases. To avoid latency or unavailability to the customers because of sudden surges in traffic, you may need to run your full capacity even at times the traffic is less, which would lead to increased infra costs. 

Below is the traffic pattern of our service. Sometimes, we see a difference in this pattern when there is crawling/spam.

Below is the scene before our autoscaling. You would notice the number of replicas we are running remains constant irrespective of throughput change.

Availability issues may crop up if you don’t have the ability to handle spam/crawling, despite having the adequate infrastructure and ability to predict traffic. Also, you cannot scale forever to accommodate sudden surge or spam; you must have proper upper limits for scaling.

Core autoscaling features also allow lower cost, reliable performance by seamlessly increasing and decreasing new instances as the demand spikes and drops. As such, autoscaling provides consistency despite the dynamic and, at times, unpredictable demand for applications.

To handle this in our business we wanted to auto-scale our pods using HPA, The Horizontal Pod Autoscaler changes the shape of your Kubernetes workload by automatically increasing or decreasing the number of pods in response to the workload’s CPU or memory consumption, or in response to custom metrics reported from within the Kubernetes or external metrics from sources outside of your cluster.

In our business we are scaling based on the custom metric Requests Per Minute (RPM), which is a quantifiable measure of traffic. Our pods boot within 90 seconds, so when they’re scaled by HPA when there is an increase in RPM, they will be available to serve the traffic within a short period of time, leading to effective scaling.

Below are the components in our RPM based Auto scaling Architecture:

  1. Prometheus exporter will export metrics on port 9394 and on endpoint /metrics
  2. Prometheus server will scrape metrics in regular intervals through service monitors and make it available on the server.
  3. HPA polls for metrics using custom metrics API
  4. Based on target configuration HPA calculate desired count and trigger scale in respective deployment reference

Let’s look further into the above-mentioned components.

Prometheus Exporter

Since our scaling is based on custom metrics (RPM), we should export those metrics from the application. 

For exporting that we are using prometheus exporter, which exports the application metrics in prometheus format on port 9394.

Sample metrics exported by prometheus exporter:

Service Monitor

Service monitor will be configured for all the application services. It actually scrapes metrics from the services in a regular intervals. The metrics are then exported using the prometheus exporter.

Service monitors will be configured with specific labels that would be later matched by the prometheus server to fetch metrics.

kind: ServiceMonitor

apiVersion: monitoring.coreos.com/v1


  name: application

  namespace: demo


    env: staging




      layer: application

      shell: demo


  - port: metrics # works for different port numbers as long as the name matches

    interval: 15s # scrape the endpoint every 15 seconds

Prometheus Server

We are running an in-cluster prometheus server accessible inside the cluster only using ClusterIp service.

This server will scrape metrics from the service monitor corresponding to labels configured in prometheus server label selector and make the prometheus metrics available for scaling.

      env: staging

There is one more service needed to convert the prometheus metrics to Kubernetes-readable metrics, for which we are using a prometheus adaptor. 

Custom Metrics API

Custom metrics Api is extended from Kubernetes API and makes custom API available for use.

We will configure prometheus endpoint and prometheus query details, executed upon access of custom metrics APIs.

apiVersion: v1
kind: ConfigMap
  name: adapter-config
  namespace: custom-metrics
  config.yaml: |
    - seriesQuery: 'ruby_http_requests_total{namespace!="",pod!=""}'
        template: "<<.Resource>>"
        matches: "^ruby_(.*)_total"
        as: "${1}_per_second"
      metricsQuery: 'sum(rate(ruby_http_requests_total{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

Below is the command to get metrics using custom metrics API:

kubectl get --raw 

Sample response for the above command:

  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/demo_namespace/services/*/http_requests_per_second"
  "items": [
      "describedObject": {
        "kind": "Service",
        "namespace": "demo_namespace",
        "name": "demo_service_1",
        "apiVersion": "/v1"
      "metricName": "http_requests_per_second",
      "timestamp": "2021-08-31T13:09:43Z",
      "value": "24004m",
      "selector": null
      "describedObject": {
        "kind": "Service",
        "namespace": "demo_namespace",
        "name": "demo_service_2",
        "apiVersion": "/v1"
      "metricName": "http_requests_per_second",
      "timestamp": "2021-08-31T13:09:43Z",
      "value": "40160m",
      "selector": null

Horizontal Pod Autoscaler (HPA):

HPA can be used to scale pods based on Default metrics (CPU, Memory), custom metrics (RPM, etc), external metrics (SQS queue message count etc). Below are the major elements of HPA.

  • Deployment Reference – Reference of deployment, which needs to be scaled based on metrics configuration
  • Targets – Metrics configuration
  • Min Replicas – Minimum number of replicas for a deployment till when HPA can scale down
  • Max Replicas – Maximum number of replicas for a deployment till when HPA can scale up
  • Desired Replicas – Number of currently needed replicas. This is not a configurable one, this is calculated by HPA based on the targets.


Effective Scaling

HPA has scaled the pods based on the RPM, and the scaling was very effective.

Cost savings

 There are around 50% cost savings after auto scaling the system. This is because the scaling is executed corresponding to spike in RPM, and hence, the number of pods running all the time is greatly reduced.

Like RPM, we can consider scaling based on Request queue time, another key metric based on which efficient scaling can be produced. Other than foreground requests we can also scale background workers based on the pending background jobs count using an HPA.

Leave a Reply

Your email address will not be published. Required fields are marked *