At Gojek Kafka serves as the backbone of all our microservices, and knowing how it is performing every minutes enables us to improve the overall user experience for our end consumers. At Gojek we use Ziggurat
to consume events from Kafka.
Although Ziggurat publishes it’s own metrics like ingestion lag and throughput, these metrics are restricted to an application level, we had been lacking a sneak peak into the internals of Kafka Streams Threads and Tasks running in each of the VMs and pods across GoJEK. In order to get a visiblility at a stream thread or stream task level we had to read the metrics published by the kafka streams client itself and push it to our monitoring backend system.
We have had numerous production issues where we were not fully aware of what was happening inside the Kafka Stream Clients running on the VMs OR Pods. We would often observe lag on one particular partition and not know which Stream Thread OR Stream Task it belonged to. To figure out this information through logs is very tedious and time consuming.
We wanted a one stop monitoring dashboard which would give us all the information about a Stream Client without having to search through the logs.
Kafka Stream Client metrics to the rescue
Kafka Stream client metrics exposes a handful of useful metrics ranging from the number of alive stream threads to the process-rates and e2e latencies. The metrics have 3 recording levels info
debug
and trace
.
Only the Task level and Processor Node level metrics require the debug
recording level. This level can be controlled by using the config metrics.recording.level="<level>"
as part of the Streams config.
These metrics are exposed via JMX, and this requires some setup to visualise them on a Dashboard.
How to publish and visualise the Kafka Stream metrics
Pre-requisites to publish and visualise the metrics
- Enabling JMX for the Kafka Streams app
- Telegraf agent with Jolokia2 Plugin and Prometheus remote write Plugin
- Prometheus deployment
- Grafana deployment
Modify the Java App startup command to expose JMX metrics
/usr/bin/java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -javaagent:/opt/jolokia.jar=port=7777,host=127.0.0.1 -server -jar /opt/jmx_metrics_test/jmx_metrics_test/<app_name>.jar
Telegraf setup
Add the following config to the jolokia2.conf file
# For exporting Stream Client level metrics
[[inputs.jolokia2_agent.metric]]
name = "kafka_streams"
mbean = "kafka.streams:type=stream-metrics,client-id=*"
tag_keys = ["client-id"]# For exporting Stream Thread level metrics
[[inputs.jolokia2_agent.metric]]
name = "kafka_threads"
mbean = "kafka.streams:type=stream-thread-metrics,thread-id=*"
tag_keys = ["client-id","thread-id"]# For exporting Stream Task level metrics
[[inputs.jolokia2_agent.metric]]
name = "kafka_tasks"
mbean = "kafka.streams:type=stream-task-metrics,thread-id=*,task-id=*"
tag_keys = ["client-id","thread-id","task-id"]
The name property in the jolokia config maps to the metric name in your timeseries database. For a complete list of metrics you can refer this page.
Depending on which timeseries database you use, the output plugins can be configured accordingly
[[outputs.influxdb]]
database = "infra-monitoring-integration"
retention_policy = ""
timeout = "5s"
url = "http://<influx_host>:8086"
write_consistency = "any"[[outputs.prometheus_remote_write]]
bearer_token = "XXXXXXXX"
retry_for_client_errors = false
url = "https://<prom_host>/v1/prom/metrics"
Creating the Grafana Dashboard
Visualising the metrics using Grafana.