A fresh view on observability for Kubernetes clusters

In this edition of our #KubernetesLearnings series (earlier blogs here, here and here), we talk about how we configured observability for our Kubernetes clusters and made it co-exist with our existing observability setup.

To give some context on the scale, in the US-East region for one app (not including other dependent services), we have:

  • about 5,000 containers,
  • about 200 billion metrics a month with about 5 million cardinality, and
  • about 7.5 TB logs per day.

Our pre-Kubernetes infrastructure (AWS OpsWorks-based) already had a metrics/logging setup:

  1. Metrics: InfluxDB + Telegraf
    a. We had Envoy sidecar in all our apps, which published all the ingress/egress metrics in statsd format
    b. EC2 + Container metrics were collected/published by Telegraf daemon
    c. We also pulled some metrics from AWS CloudWatch
    d. We had Grafana dashboards to visualize these metrics
    e. Alerts were done using a Tick script-based setup
  2. Logging: Custom solution built using ElasticSearch
    a. All app and container logs were ingested
    b. All EC2 system service logs were also ingested via the same pipeline
    c. We had log rotation/retention configured for all these logs (upload to S3 bucket, etc.)
  3. An internal central team provided/managed metrics and logging as a single service. It was built for multi-tenants and high availability.

Additionally, all these were tagged with appropriate tags:

  • Blue/green to track errors/performance during deployment/releases
  • Shell, which is our fundamental primitive to minimize blast radius and improve availability, something along the lines of cell-based architecture.

Our existing observability was already configured based on these high-level primitives, and was working reasonably well.

When we moved to Kubernetes, we made an important decision early on that this configuration needed to continue to work as it was. Whether a given shell is powered by OpsWorks or Kubernetes is merely an internal detail and should not have to be exposed all the way to the engineer.

Having made that decision, we came up with the following high-level requirements:

  1. Ability to retain existing pre-Kubernetes high-level primitives regardless of the underlying infrastructure used
  2. Must support multiple Kubernetes clusters
  3. Ability to observe metrics/logging when we shift traffic gradually (over multiple weeks) from OpsWorks to Kubernetes
  4. Ability to treat Kubernetes as a first-class citizen:
  • All containers having Prometheus annotations should be automatically scrapped
  • Other Kubernetes metrics such as Pods/Deployments, EKS CNI, and CoreDNS must be collected and/or alerted on
  • All metrics must have appropriate tags (shell, blue/green, cluster name, region, availability zone, etc.).

1. Architecture

After formulating the requirements, we came up with the following high-level configuration:

1.1 Metrics

We run three kinds of metrics collection agents (aka Telegraf).

1. At the cluster level as a Deployment with one replica. This will collect the following metrics:

  • Kubernetes objects metrics via kube-state-metrics
  • Kubernetes API Server metrics
  • Pod Prometheus metrics, automatically discovered/scrapped via Prometheus annotations
  • Kubernetes control plane metrics (from API server, managed by EKS)
  • Certain AWS CloudWatch metrics
  • AWS service-level quotas (ENI usage/threshold, etc.) and IP availability for EKS subnets (these are done via Telegaf exec plugin)

2. At each node as DaemonSet. This will collect the following metrics:

  • Metrics from Kubelet
  • EC2 metrics (CPU, disk, network, memory, etc.)
  • Docker metrics
  • Many node-specific custom metrics such as conntrack errors to detect problems such as this and OOM kills. (These, too, are done via Telegaf exec plugin)

3. Within Pod as sidecar. This will collect the following metrics:

  • Statsd metrics from Envoy (yet to move to use Envoy Prometheus endpoint!)
  • Multiple Prometheus metrics from the application (Note that this can be retired once we move to expose all these metrics via Prometheus annotation.)

1.2 Logging

We run two kinds of log collection agents (filebeat based agent):

1. One as DaemonSet in each node. This will manage the following logs:

  • EC2 system logs
  • Kubelet logs
  • Container logs (including app logs sent via stdout/stderr)
  • Log rotation, retention, S3 upload, etc.

2. Within Pod as sidecar. This will manage the following logs:

  • Unstructured application logs. Some of our apps produce a high volume of logs. From our experiments, logging them to stdout and ingesting via docker logging driver resulted in high CPU usage. So we decided to not do that. We could have used hostPath and then configured the DaemonSet to collect logs from that directory, but then we would have lost important tags. So we decided to go with sidecar.
  • Log rotation, retention, S3 upload, etc.

2. Deployment

Our monitoring deployment is completely managed by GitOps model using Kustomize. (We will have another post in this series later to explain our GitOps model.)

It looks something like below (leaving out non metrics/logging-related items). We define one base monitoring/logging setup. This is used globally across all AWS regions where we operate, only overriding parts that they want (like metrics publish endpoint, etc). This way we have a consistent, homogeneous setup globally.

Here are some of our dashboards (picked randomly):

  • Node dashboard

Overall CPU, memory usage, breakdown by Pods, also showing container throttling, if any (we try to bin-pack/over-commit, so keeping an eye on this metric is important). Other parts of the dashboard are not shown below, but those show other similar metrics (network, disk, OOM kills, page faults, context switches, etc).

  • We use EKS-CNI as the Kubernetes CNI plugin. So tracking some important parts of those metrics (like how many IPs are available in subnet etc).

  • We used this dashboard (below) when migrating from OpsWorks to Kubernetes.

The image below shows one such app migration in the US-East region.

A view to the future

With the amount of metrics we are generating, we seem to be hitting the InfluxDB limits, we are actively looking into how to solve that problem.

As we are slowly transitioning into the microservice world, we are looking to add support for tracing, as simple logging and metrics may not help under all circumstances.

Once we completely migrate to Kuberenetes, we plan to move away from statsd to Prometheus as it is a first-class citizen in the Kubernetes world (auto-discovery, scrapping by annotation, etc).

When we configured the above monitoring setup, we had to re-create many dashboards from scratch. We did this loosely based on CoreOS Prometheus dashboards since InfluxDB doesn’t support PromQL. We are hoping to move to use native PromQL support in InfluxDB, as we could then use open source dashboards as they are.