How you can monitor self-host Cassandra

Have you ever managed the Cassandra cluster? If the answer is yes, you know the importance of monitoring Cassandra. If not, read through this blog, and you will understand it. You will see how to generate, collect, visualize, and set alerts on all metrics related to Cassandra. This is a complete monitoring solution for the Cassandra cluster.

Cassandra is an open-source NoSQL distributed database that is highly scalable and highly available without compromising performance. Since Cassandra is a complicated and distributed system, managing a self-hosted Cassandra requires the management of low-level bare metal components up to high-level software. Such a cluster consists of many nodes. It involves monitoring the machines on which Cassandra is running, the JVM environment in which Cassandra is running, and the dedicated Cassandra metrics.

Hypertrail Managed Cassandra Cluster

We at Hypertrail have developed a pipeline for monitoring the Cassandra cluster. Hypertrail is a service that stores and retrieves activity timelines. We use Cassandra as our primary database. To give an idea of our cluster size, we have around 15 nodes in the US region alone. Each node has about 1.3 TB of data. We are continuously increasing our node numbers to cater to our increasing load. Managing all of these requires a well-managed, scalable monitoring system.

We have built a monitoring system for production-level Cassandra with all open-source technologies. Prometheus and Grafana are a well know monitoring solutions. To collect data, we used the Telegraf plugin. Further, to extract data from Cassandra, we have used Jolokia.

What are Cassandra Metrics?

Cassandra exposes many metrics for performance monitoring, describing how the system and its parts perform. These metrics are the DroppedMessage Metrics, Compaction metrics, CommitLog metrics, HintedHandoff Metrics, Storage Metrics, Cache metrics, and many more. The DroppedMessage Metrics are specific to tracking dropped messages for different requests. Cache metrics track the effectiveness of the caches. Compaction metrics are specific to compaction work. These are some of the dedicated Cassandra Metrics.

Other metrics that we can monitor are JVM metrics. This would also allow us to track the health of the underlying JVM where Cassandra is running.

Machine-level metrics can be exported through the Prometheus node exporter. Node exporter can expose various hardware and kernel-related metrics for Linux machines. This helps in monitoring infrastructure.

High-Level Diagram To show Monitoring Data Collection Points
This image illustrates a high-level diagram of monitoring data collection points.

Exporting these metrics to Prometheus would allow us to see how the system performs in real time and extract trends on how the system behaves historically for varying loads.

We can see the list of all the Cassandra metrics here.

How are Cassandra Metrics exposed?

Like most Java applications, Cassandra exposes metrics on availability and performance via JMX, which is based on MBean. An MBean is a managed Java object. At the core of JMX is the MBean Server, an element that acts as an intermediary between the MBeans and the outer world. This server helps in interaction with the MBeans. For non-Java applications, Jolokia helps in accessing JMX API.

What is Jolokia?

Jolokia is an open-source software with two components: the Jolokia agent and the Jolokia client. Since JMX can only be consumed with Java applications, this is where Jolokia proves helpful.

Jolokia Agent

Jolokia agent that can be deployed on JVMs to expose their MBeans through a REST-like HTTP endpoint, making all this information readily available to non-Java applications running on the same host. We will deploy the agent as a normal JVM agent in this post.

Jolokia Client

The Jolokia client will listen to the Jolokia agent and generate the metrics for us. We will be using the Telegraf Jolokia2 input plugin as the Jolokia client. The Telegraf Jolokia2 input plugin will communicate with the Jolokia agent and get the metrics from MBean Server. Telegraf will generate Prometheus metrics as output to a specific endpoint (like localhost:9273/metrics). This endpoint could be configured in Telegraf output in telegraf.conf.

[[outputs.prometheus_client]]

listen = “:9273”

Prometheus can scrape those metrics from the Telegraf output endpoint.

This image depicts a detailed diagram of Cassandra Jolokia interaction.
This image depicts a detailed diagram of Cassandra Jolokia interaction.


Installing Jolokia as JVM Agent with Cassandra Node

This is very simple. You can follow this blog to install Jolokia. JVM Agent exposes metrics at http://localhost:7777/jolokia

NOTE: Telegraf jolokia2 input plugin needs POST method access. Please ensure that you add POST to the security policy of the Jolokia Agent. Otherwise, you can get 403 status.

[inputs.jolokia2_agent] Error in plugin: Unexpected status in response from target : 403

Allowing POST in the security policy will solve this.

<http>

 <method>post</method>

</http>

Configure Jolokia2 Input Plugin On Cassandra Node

These metrics are on HTTP but not Prometheus metrics. To convert Jolokia agent metrics to Prometheus metrics, we will be using the Jolokia2 input plugin of Telegraf.

The Jolokia2 input plugin will listen to the Jolokia JVM agent at localhost:7777 and generate output at localhost:9273.

For this, install telegraf on the node. Make changes in telegraf.conf. You can refer to the sample jolokia2 input plugin for Cassandra here.

[[outputs.prometheus_client]]

listen = “:9273”

[[inputs.jolokia2_agent]]

urls = [“http://localhost:7777/jolokia"]

name_prefix = “cassandra_”

[[inputs.jolokia2_agent.metric]]

name = “Cache”

mbean = “org.apache.cassandra.metrics:name=*,scope=*,type=Cache”

tag_keys = [“name”, “scope”]

field_prefix = “$1_”

You can verify the metrics by hitting curl localhost:9273/metrics

Configure Prometheus to scrape

Finally, you can add scrapper in prometheus.yml to scrape from the localhost:9273/metrics.

scrape_configs:

- job_name: telegraf_metrics

 metrics_path: "/metrics"

 static_configs:

 - targets:

    - "localhost:9273"

Add Node Exporter to Prometheus

We have added a node exporter in the Prometheus config. This exports our machine or host-level metrics like CPU, load, memory, and storage to Prometheus. We are pushing infrastructure metrics with the node exporter. All the metrics the node exporter pushes are helpful in monitoring that particular node’s performance.

Grafana

We have configured Grafana for the dashboard. Grafana is a potent visualization tool. It has an excellent design interface and visualization. We can configure several data sources in it. For our use case, we have only the Prometheus data source. We can configure that and start creating great dashboards on Grafana. You can get started here.

Grafana

Alerts

Monitoring is incomplete without alerts. Alerting helps to notify problems when they happen or even gives an indication of potential failures in the future. We have used Alertmanager is used for configuring alerts. It can be configured to send alerts on various mediums like slack and email. There is a lot of flexibility in sending different alerts to different channels.

That’s how we have managed our self-hosted Cassandra cluster monitoring.

Further Reading

The author also writes technical blog posts on Blog Your Code

Design: Sneka Balakrishnan