Building an app with a focus on observability

Applications deployed on the cloud keep growing in size and complexity every day, bringing the need for DevOps engineers to make sure that their application is running in the most efficient way possible. There are a lot of problems, solutions and best practices to make sure an application is efficient, under various categories.

How to get proper visibility into your applications?

Let’s take a look at methods to improve the observability of your application.

Observability

Observability is the measure of how well your application’s state can be understood, and how clearly inferences are drawn from that. You need to design your application in a way that it provides all the necessary information for you to know what is happening at all times.

To improve observability, make your application debuggable by working on 

  • Metrics
  • Logs
  • Traces

and integrate these into a monitoring + alerting system.

1: Debuggability

Large applications can be difficult to debug. When you’re serving millions of requests generating millions of lines of unstructured log, how do you debug such an application?

Bring in the debuggability right at the design stage of your application. Ensure it is able to generate enough data to help you identify what is going wrong where, and why. Go through the checklist below and see what all you have done and what all you need to do.

  • Metrics: Identify patterns in your application that will help you understand its state and visualize them as structured logs. It could be on a per request basis or a periodic state log. Ensure all metrics applicable to your application’s state are included in this, eg. response code, response time, custom identifiers, etc.
  • Identifiers: Use X-Request-ID or X-Correlation-ID headers in your requests to include information to correlate requests between client and server. Include UUID (Universally Unique IDentifier, a unique number generated at client end) in this header for each request. Include these in your request-specific structured log, too, to help track the lifecycle of requests and also in tracing.
  • Key Performance Indicators: Define important modules and key transactions for your application and establish rules for their performance and log those metrics to get insights on how well your application is performing.
  • Logs:  Enable / include logging for all the components in your architecture. Load balancer, proxy server, web and application server, background server, database, network logs. For security, ensure your customer’s data / any other sensitive data is not written plainly to logs or pushed to any third party log management service.
  • Log management: Collect these logs in a log management tool by running the tool’s agent in your servers, which will help you with analysis, aggregation and reporting. These tools come with query support with which you can build dashboards, configure reports for known patterns and configure alerts for anomalies in the reports. Choose a log management tool and look into their documentation about installing and configuring the agent.
  • Tracing: Make sure your application provides information about the flow of transactions. Propagate the UUID of your request through all the transactions it goes through. This will help debugging specific requests and track what went wrong where.

If you generate lots of logs, make sure your logrotate functionality is in place. Review the rules about periodicity and size of logrotate and ensure it is fine with respect to your servers.

2: Monitoring

Continuously gather metrics about your application and infrastructure in a monitoring system. Explore all the monitoring tools available and choose the one that suits you best in features, cost and platform support. Complete the setup needed for the tool and start posting metrics from your application.

Steps to setup a proper monitoring system:

  • Define metrics for proper functioning of your application (eg. uptime, request rate, response time, error rate etc) and infrastructure (eg. server health metrics like CPU, memory, storage / process health metrics like restarts, OOM alerts etc). Regularly review and evolve the metrics.
  • Review your code and infrastructure architecture and ensure the metrics cover all your modules and resources.
  • Define thresholds for the metrics. Normal functioning of the application, and the various levels of degradations should be known with these metrics.
  • Include business-level metrics like SLOs and SLAs for the application wherever applicable and make it define your application’s performance.
  • Identify ways to collect the defined metrics in your monitoring tool and to watch for their being within the thresholds. Call out patterns and anomalies.
  • Create queries and build dashboards for visualizing the evolution of your application’s state.
  • Health checks: Respond with just a 200, no processing, for a light-weight health check endpoint. Make such health check requests from your monitoring system to individual servers to keep track of active servers and build a recovery mechanism for servers that go down. With this, you can also make sure requests aren’t forwarded to non-responsive servers.
  • Look for support of transaction and distributed tracing monitoring in your tool and integrate it to monitor your requests and transactions.
  • Tag all deployments and releases in your monitoring system. With this, you can check the effect of each release, how the metrics look before and after the release.
  • Set up an alerting system with the monitoring in place. Write down conditions to be followed by each metrics and when they are not met, tag them as incidents and send out alerts about them.
  • Set up an incident management plan out of the alerts created. Track incidents out of alerts to know about frequently occurring issues. Set up escalation channels and resolution steps for known incidents.
  • Ensure every infra component of your architecture has some form of monitoring in place.

Are we missing something?

You would monitor only the metrics that you know of. What about all the unknowns? Your monitoring systems should always be in a continuous state of improvement. The ideal place to be is that “there should never be any incident, without an alert about it from your monitoring system”. 

3: Alerting

Tracking all patterns in your monitoring system is not enough. You need to find out all the anomalies in the patterns and alert them. Your monitoring tool would support this inherently. The lifecycle of an alert is as follows:

  1. Your monitoring system looks continuously for metrics to be within thresholds.
  2. In the event of any violation, an alert is triggered with the required information.
  3. Notification about the alert occurs, as configured.

Alerts can be simple notifications via emails or calls, but there’s more you can do with them.

Configure the notification as a webhook to your incident management system to open an incident. You can also set up various automations upon receiving alerts, depending on the type of the alert. A few ideas for automated alert responses include:

  • Scaling your servers up and down
  • Automatically healing / replacing failed servers
  • Increase / decrease resources available in a pool
  • Detecting failures and preventing recurrence (circuit breakers)
  • Switching traffic to alternate servers
  • Blocking malicious traffic
  • Pre-processing of alert to drop false positives

Implementation of the automated response depends on the nature of the alert. Some of these can be received by your application and processed there. Some can be resolved by serverless execution offered by your cloud provider. Some can be automated with your incident management system.

Why alert management?

Bookkeeping of alerts received in an incident management system is a good-to-have solution. This would help you track, identify, manage, analyse and correct the alerts generated by your application. Having a dedicated incident response team with a system of storing and working with the alerts lets you improve on the solutions, time taken for resolution and helps you identify patterns of alerts. 

With this, further you can work on:

  • Avoiding recurring alerts
  • Identifying new alerts for which automated responses can be built.
  • Defining Standard Operating Procedures for alert responses for your incident response team
  • Maintaining root cause analysis of the incidents for audit and tracking
  • Developing a status page with uptime and incident history that can also have public subscriptions.

There are various free and paid alert / incident management systems, you can choose one that suits your use case the best.

To wrap up, making your application observable is important in understanding its state and in detecting failures and making it recover. This post covers various practices under observability. You can pick the points that would be applicable to your workload and implement them. 

Upcoming topics in the series about deploying and scaling your application in production:

  • Availability
  • Scalability
  • Cost optimization and more