The Hidden Problem with Kafka Lag Monitoring

Your alert fires at 3 AM: "Consumer group orders-processor has 50,000 messages lag." You race to your laptop, anxious. Is this 10 seconds, 10 minutes, or 10 hours behind? The alert doesn't say. You check the dashboard - still 50,000 messages.
But what does that actually mean for your users?
The metric everyone tracks - offset lag - tells you almost nothing about actual consumer health. A lag of 1,000 messages could represent 1 second or 1 day of delay, depending on your topic's throughput. Yet this is the number plastered across dashboards everywhere.
The metric that actually matters is time lag - how many seconds behind your consumer really is. But most monitoring tools don't provide it accurately, or at all.
Let me explain why, and what you can do about it.
The offset lag illusion
Let's start with the basics. Offset lag is calculated as:
offset_lag = latest_offset - committed_offset
Where:
- latest_offset is the position of the next message that will be written (high watermark)
- committed_offset is where your consumer has processed up to
Simple, right? Two numbers, subtract one from the other, done.
This simplicity is exactly why every monitoring tool shows it. The Kafka Admin API gives you both numbers directly - no message fetching required, no timestamp parsing, just metadata.
But here's where things fall apart.
Offset lag is completely disconnected from time. Consider these two scenarios:

Same metric type, but vastly different severity. That 3 AM alert for 50,000 messages? On a busy topic, your on-call engineer might lose sleep over nothing. On a quiet topic, even 50 messages could mean your consumer has been dead for days.
The idle producer trap
It gets worse. What happens when a producer stops sending messages?
Your offset lag stays flat. The committed offset doesn't move because there's nothing new to consume. The latest offset doesn't move because nothing's being produced. The delta? Zero change.
Meanwhile, the actual delay keeps growing. If that last message was produced an hour ago and your consumer is still sitting at that offset, you're an hour behind in real time - even though your offset lag says everything is fine.
I've seen teams get burned by this exact scenario. A low-traffic topic that processes once-daily batch jobs showed "stable lag" for a week. It turns out the consumer had crashed, but since the producer only sent messages once a day, the lag number barely moved. Users discovered the problem when their daily reports stopped arriving.
Why time lag is hard to get right
If time lag is so important, why doesn't every tool show it?
Because calculating it properly is expensive and tricky.
Time lag is defined as:
time_lag = current_time - timestamp_of_message_at_committed_offset
To get that timestamp, you need to actually read the message at the committed offset. That means:
- Creating a consumer
- Seeking to the committed offset
- Fetching the message
- Parsing its timestamp
- Doing this for every partition with lag
For a cluster with hundreds of consumer groups across thousands of partitions, this adds up fast. Most tools avoid it entirely.
The interpolation shortcut (and why it fails)
Some tools try to be clever. The popular kafka-lag-exporter (written in Scala) builds a lookup table of (offset, timestamp) pairs over time. When you need to estimate the timestamp for a given offset, it interpolates between known points.
Sounds reasonable. Here's why it breaks:
- Cold start problem: The lookup table starts empty. Until enough data points accumulate, the metric effectively has no value.
- Idle producer edge case: When production stops, the lookup table compresses repeated offsets into a flat segment. Interpolating within a flat segment causes division by zero (dy = 0 in the prediction formula), resulting in NaN for time lag.
- Variable throughput: Linear interpolation assumes constant message rate between data points. Bursts occurring within a single poll interval (default 30s) lead to inaccurate time estimates, as the actual non-linear production pattern isn't captured.
When interpolation fails (cold start, insufficient data), the system reports NaN for time lag. In dashboards, this appears as missing data points rather than zero. The offset lag metric (which doesn't rely on interpolation) remains accurate even when time lag cannot be calculated.
The right way: Direct timestamp sampling
There's no shortcut to accuracy. If you want real time lag, you need to read actual message timestamps.
Here's the approach implemented in klag-exporter:
For each partition with lag > 0:
- Create a consumer and seek to the committed offset
- Fetch one message
- Extract its timestamp
- Calculate:
time_lag = now - message_timestamp - Cache the result to avoid hammering Kafka
The key insight is that you only need to fetch timestamps for partitions that actually have lag. A consumer that's caught up (lag = 0) doesn't need a timestamp - we know it's current.
For partitions with lag, the timestamp fetch is cached with a TTL. If the consumer hasn't moved (same committed offset), the cached timestamp is still valid. Only when the consumer commits a new offset do we fetch a fresh one.
This approach gives you:
- Accurate time lag regardless of topic throughput
- Correct behavior for idle producers (time lag keeps growing as it should)
- Bounded Kafka load through caching
What about edge cases?
Two situations can throw off timestamp accuracy:
Log compaction: If the topic uses cleanup.policy=compact, the message at your committed offset might have been compacted away. Kafka returns the next available message, which has a later timestamp. Your reported lag will be slightly lower than reality.
Retention deletion: If your committed offset points to a message that was deleted by retention policy, the same problem occurs - Kafka returns a newer message.
Both cases result in understated lag (showing less delay than actual). A proper implementation detects these conditions and flags them so you know the number might be off.
TL;DR: Time lag requires reading actual messages. Interpolation is tempting but fails for idle/low-throughput topics. Direct sampling with caching gives you accurate numbers without killing your brokers.
The metrics you actually need
Here's what your Kafka monitoring should include:
Per partition - for debugging specific issues
kafka_consumergroup_group_lag{group="orders", topic="orders", partition="0"} 1523
kafka_consumergroup_group_lag_seconds{group="orders", topic="orders", partition="0"} 12.5Per group - for alerting
kafka_consumergroup_group_max_lag{group="orders"} 2847
kafka_consumergroup_group_max_lag_seconds{group="orders"} 45.2
Offset lag (group_lag) is still useful—it tells you how many messages need processing. But it should be paired with time lag (group_lag_seconds) to understand severity.
For alerting, use max_lag_seconds across all partitions. This gives you the worst-case delay for the group:
Alert when any partition is more than 60 seconds behind
kafka_consumergroup_group_max_lag_seconds{group="orders-processor"} > 60This alert fires based on actual delay, not arbitrary message counts. An SLA that says "process orders within 30 seconds" can finally be monitored properly.
Introducing klag-exporter
We have built a klag-exporter to solve these problems. It's written in Rust for performance and uses direct timestamp sampling instead of interpolation.

Key differences from existing tools:

The exporter also detects compaction and retention edge cases, flagging affected partitions so you know when to take the time lag with a grain of salt.
Quick start with Docker
Let's start with the simplest possible setup and build from there.
Minimal setup
- Create a basic config

- Run the exporter

Replace your-kafka-network with whatever Docker network your Kafka is on, and update bootstrap_servers to point to your broker(s).
- Verify if it works

You should see metrics like:
kafka_consumergroup_group_lag{cluster_name="my-cluster",group="my-group",topic="my-topic",partition="0"} 1523
kafka_consumergroup_group_lag_seconds{cluster_name="my-cluster",consumer_group="my-group",topic="my-topic",partition="0"} 12.5
Complete demo stack
Want to see everything working together - Kafka, producers, consumers, Prometheus, and Grafana - with pre-built dashboards? The repo includes a ready-to-run demo:


The demo includes producers and consumers that generate realistic lag patterns, so you can immediately see time lag and offset lag metrics in action.
Configuration deep dive
- Exporter settings

Granularity trade-offs:

For most setups, start with a partition. You can always aggregate in Prometheus if needed.
- Cluster configuration

The whitelist/blacklist patterns are regular expressions. They're applied in order: whitelist first, then blacklist removes matches.
Authentication methods
klag-exporter supports all standard Kafka authentication mechanisms through consumer_properties.
SASL/PLAIN (username/password):

SASL/SCRAM (with TLS):

mTLS (mutual TLS, no username):

AWS MSK with IAM:

Environment variable substitution
All config values support ${VAR_NAME} syntax - no need to bake secrets into the config file:

Kubernetes deployment
For production, you'll typically run klag-exporter in Kubernetes. In our repo at GitHub, you can find ready-to-use helm charts to easily deploy klag-exporter in your cluster.
Grafana dashboard
Importing the dashboard
Option A: Copy the JSON file
From the repo

Option B: Import via Grafana UI
- Go to Dashboards → Import
- Upload
kafka-lag.jsonor paste its contents - Select your Prometheus datasource
- Click Import
Key panels
The included dashboard has several sections:
Overview stats:
- Total Consumer Lag (sum across all groups)
- Max Time Lag (worst-case delay)
- Exporter Status (UP/DOWN)
- Scrape Duration (collection performance)
- Consumer Groups (count of monitored groups)
Time series:
- Consumer Group Total Lag - trend over time
- Time Lag in Seconds - the metric that matters
- Per-Partition Lag - for debugging specific issues
Throughput & distribution:
- Partition Latest Offsets - distribution across partitions
- Topic Message Rate - produce throughput over time
Dashboard variables
The dashboard includes variables for filtering:

Use these to drill down when investigating issues.
Multi-cluster setup
Klag-exporter can monitor multiple Kafka clusters from a single instance. Just add more [[clusters]] blocks:

Each cluster gets its own cluster_name label in metrics, so you can filter by cluster in Grafana and alerts.
When to use one exporter vs. many:

OpenTelemetry integration
Klag-exporter can also push metrics via OTEL in addition to (or instead of) the Prometheus endpoint.

This is useful if you're using Datadog, New Relic, Honeycomb, or other backends that support OTEL ingestion.
The test stack includes an OpenTelemetry Collector that receives these metrics and exports them to both debug output (logs) and a Prometheus endpoint.
Troubleshooting
Exporter shows no metrics
- Check connectivity to Kafka:

- Check exporter logs:

- Verify authentication:
Look for SASL or SSL errors in logs. Double-check credentials and certificates.
Metrics exist but no consumer groups
- Check your whitelist/blacklist patterns: If you have restrictive patterns, groups might be filtered out.
- Verify consumer groups exist: kafka-consumer-groups --bootstrap-server kafka:9092 --list
- Check if consumers have committed offsets: klag-exporter only shows groups that have committed offsets.
Time lag is always 0
- Verify timestamp sampling is enabled:

- Check if consumers have lag: If
group_lagis 0,group_lag_secondswill also be 0 (caught up = no time lag).
Grafana shows "No data"
- Check Prometheus targets: http://localhost:9090/targets - is klag-exporter UP?
- Verify the datasource: Grafana → Configuration → Data Sources → Test
- Check metric names: The dashboard expects specific metric names. If you customized them, update the queries.
What you should have now
If you followed this guide, you've got:
- klag-exporter collecting both offset lag and time lag metrics
- Prometheus scraping and storing those metrics
- Grafana dashboards showing lag trends and distributions
- Alerts configured for your SLAs
- Awareness of edge cases (compaction/retention detection)
Your on-call engineers can now respond to "Consumer X is 45 seconds behind" instead of "Consumer X has 50,000 messages lag" - a far more actionable alert.
Key takeaways
- Offset lag tells you how many messages behind; time lag tells you how far behind in real time. Both matter, but time lag is what your SLAs are measured in.
- Interpolation-based tools fail for idle or low-throughput topics. They show zero lag when you might be hours behind.
- Direct timestamp sampling is more accurate but needs caching. Fetch only for partitions with lag, and cache results with a TTL.
- Alert on time lag, not offset lag. Your on-call engineers will thank you when they can immediately understand severity.
If you're only monitoring offset lag today, you're flying partially blind. Adding time lag to your dashboards is the single most impactful improvement you can make to your Kafka monitoring.
The messages aren't going anywhere. But your users are waiting - and they notice the delay.
Reviewed by: Grzegorz Kocur
