The Hidden Problem with Kafka Lag Monitoring

30 Jan 2026. 9 minutes read

The Hidden Problem with Kafka Lag Monitoring featured image

Your alert fires at 3 AM: "Consumer group orders-processor has 50,000 messages lag." You race to your laptop, anxious. Is this 10 seconds, 10 minutes, or 10 hours behind? The alert doesn't say. You check the dashboard - still 50,000 messages.

But what does that actually mean for your users?

The metric everyone tracks - offset lag - tells you almost nothing about actual consumer health. A lag of 1,000 messages could represent 1 second or 1 day of delay, depending on your topic's throughput. Yet this is the number plastered across dashboards everywhere.

The metric that actually matters is time lag - how many seconds behind your consumer really is. But most monitoring tools don't provide it accurately, or at all.

Let me explain why, and what you can do about it.

The offset lag illusion

Let's start with the basics. Offset lag is calculated as:

offset_lag = latest_offset - committed_offset

Where:

latest_offset is the position of the next message that will be written (high watermark)
committed_offset is where your consumer has processed up to

Simple, right? Two numbers, subtract one from the other, done.

This simplicity is exactly why every monitoring tool shows it. The Kafka Admin API gives you both numbers directly - no message fetching required, no timestamp parsing, just metadata.

But here's where things fall apart.

Offset lag is completely disconnected from time. Consider these two scenarios:

Scenarios%20table

Same metric type, but vastly different severity. That 3 AM alert for 50,000 messages? On a busy topic, your on-call engineer might lose sleep over nothing. On a quiet topic, even 50 messages could mean your consumer has been dead for days.

The idle producer trap

It gets worse. What happens when a producer stops sending messages?

Your offset lag stays flat. The committed offset doesn't move because there's nothing new to consume. The latest offset doesn't move because nothing's being produced. The delta? Zero change.

Meanwhile, the actual delay keeps growing. If that last message was produced an hour ago and your consumer is still sitting at that offset, you're an hour behind in real time - even though your offset lag says everything is fine.

I've seen teams get burned by this exact scenario. A low-traffic topic that processes once-daily batch jobs showed "stable lag" for a week. It turns out the consumer had crashed, but since the producer only sent messages once a day, the lag number barely moved. Users discovered the problem when their daily reports stopped arriving.

Why time lag is hard to get right

If time lag is so important, why doesn't every tool show it?

Because calculating it properly is expensive and tricky.

Time lag is defined as:

time_lag = current_time - timestamp_of_message_at_committed_offset

To get that timestamp, you need to actually read the message at the committed offset. That means:

Creating a consumer
Seeking to the committed offset
Fetching the message
Parsing its timestamp
Doing this for every partition with lag

For a cluster with hundreds of consumer groups across thousands of partitions, this adds up fast. Most tools avoid it entirely.

The interpolation shortcut (and why it fails)

Some tools try to be clever. The popular kafka-lag-exporter (written in Scala) builds a lookup table of (offset, timestamp) pairs over time. When you need to estimate the timestamp for a given offset, it interpolates between known points.

Sounds reasonable. Here's why it breaks:

Cold start problem: The lookup table starts empty. Until enough data points accumulate, the metric effectively has no value.
Idle producer edge case: When production stops, the lookup table compresses repeated offsets into a flat segment. Interpolating within a flat segment causes division by zero (dy = 0 in the prediction formula), resulting in NaN for time lag.
Variable throughput: Linear interpolation assumes constant message rate between data points. Bursts occurring within a single poll interval (default 30s) lead to inaccurate time estimates, as the actual non-linear production pattern isn't captured.

When interpolation fails (cold start, insufficient data), the system reports NaN for time lag. In dashboards, this appears as missing data points rather than zero. The offset lag metric (which doesn't rely on interpolation) remains accurate even when time lag cannot be calculated.

The right way: Direct timestamp sampling

There's no shortcut to accuracy. If you want real time lag, you need to read actual message timestamps.

Here's the approach implemented in klag-exporter:

For each partition with lag > 0:

Create a consumer and seek to the committed offset
Fetch one message
Extract its timestamp
Calculate: time_lag = now - message_timestamp
Cache the result to avoid hammering Kafka

The key insight is that you only need to fetch timestamps for partitions that actually have lag. A consumer that's caught up (lag = 0) doesn't need a timestamp - we know it's current.

For partitions with lag, the timestamp fetch is cached with a TTL. If the consumer hasn't moved (same committed offset), the cached timestamp is still valid. Only when the consumer commits a new offset do we fetch a fresh one.

This approach gives you:

Accurate time lag regardless of topic throughput
Correct behavior for idle producers (time lag keeps growing as it should)
Bounded Kafka load through caching

What about edge cases?

Two situations can throw off timestamp accuracy:

Log compaction: If the topic uses cleanup.policy=compact, the message at your committed offset might have been compacted away. Kafka returns the next available message, which has a later timestamp. Your reported lag will be slightly lower than reality.

Retention deletion: If your committed offset points to a message that was deleted by retention policy, the same problem occurs - Kafka returns a newer message.

Both cases result in understated lag (showing less delay than actual). A proper implementation detects these conditions and flags them so you know the number might be off.

TL;DR: Time lag requires reading actual messages. Interpolation is tempting but fails for idle/low-throughput topics. Direct sampling with caching gives you accurate numbers without killing your brokers.

The metrics you actually need

Here's what your Kafka monitoring should include:

Per partition - for debugging specific issues

kafka_consumergroup_group_lag{group="orders", topic="orders", partition="0"} 1523

kafka_consumergroup_group_lag_seconds{group="orders", topic="orders", partition="0"} 12.5

Per group - for alerting

kafka_consumergroup_group_max_lag{group="orders"} 2847

kafka_consumergroup_group_max_lag_seconds{group="orders"} 45.2

Offset lag (group_lag) is still useful—it tells you how many messages need processing. But it should be paired with time lag (group_lag_seconds) to understand severity.

For alerting, use max_lag_seconds across all partitions. This gives you the worst-case delay for the group:

Alert when any partition is more than 60 seconds behind

kafka_consumergroup_group_max_lag_seconds{group="orders-processor"} > 60

This alert fires based on actual delay, not arbitrary message counts. An SLA that says "process orders within 30 seconds" can finally be monitored properly.

Introducing klag-exporter

We have built a klag-exporter to solve these problems. It's written in Rust for performance and uses direct timestamp sampling instead of interpolation.

klag%20exporter

Key differences from existing tools:

differencec%20exporter

The exporter also detects compaction and retention edge cases, flagging affected partitions so you know when to take the time lag with a grain of salt.

Quick start with Docker

Let's start with the simplest possible setup and build from there.

Minimal setup

Create a basic config

step%201

Run the exporter

step2

Replace your-kafka-network with whatever Docker network your Kafka is on, and update bootstrap_servers to point to your broker(s).

Verify if it works

step%203

You should see metrics like:

kafka_consumergroup_group_lag{cluster_name="my-cluster",group="my-group",topic="my-topic",partition="0"} 1523

kafka_consumergroup_group_lag_seconds{cluster_name="my-cluster",consumer_group="my-group",topic="my-topic",partition="0"} 12.5

Complete demo stack

Want to see everything working together - Kafka, producers, consumers, Prometheus, and Grafana - with pre-built dashboards? The repo includes a ready-to-run demo:

step%204

tabela%20services

The demo includes producers and consumers that generate realistic lag patterns, so you can immediately see time lag and offset lag metrics in action.

Configuration deep dive

Exporter settings

exporter%20settings

Granularity trade-offs:

granularity

For most setups, start with a partition. You can always aggregate in Prometheus if needed.

Cluster configuration

cluster%20chart

The whitelist/blacklist patterns are regular expressions. They're applied in order: whitelist first, then blacklist removes matches.

Authentication methods

klag-exporter supports all standard Kafka authentication mechanisms through consumer_properties.

SASL/PLAIN (username/password):

sasl

SASL/SCRAM (with TLS):

scram

mTLS (mutual TLS, no username):

mtls

AWS MSK with IAM:

msk

Environment variable substitution

All config values support ${VAR_NAME} syntax - no need to bake secrets into the config file:

bootstrap

Kubernetes deployment

For production, you'll typically run klag-exporter in Kubernetes. In our repo at GitHub, you can find ready-to-use helm charts to easily deploy klag-exporter in your cluster.

Grafana dashboard

Importing the dashboard

Option A: Copy the JSON file

From the repo

repo

Option B: Import via Grafana UI

Go to Dashboards → Import
Upload kafka-lag.json or paste its contents
Select your Prometheus datasource
Click Import

Key panels

The included dashboard has several sections:

Overview stats:

Total Consumer Lag (sum across all groups)
Max Time Lag (worst-case delay)
Exporter Status (UP/DOWN)
Scrape Duration (collection performance)
Consumer Groups (count of monitored groups)

Time series:

Consumer Group Total Lag - trend over time
Time Lag in Seconds - the metric that matters
Per-Partition Lag - for debugging specific issues

Throughput & distribution:

Partition Latest Offsets - distribution across partitions
Topic Message Rate - produce throughput over time

Dashboard variables

The dashboard includes variables for filtering:

variable

Use these to drill down when investigating issues.

Multi-cluster setup

Klag-exporter can monitor multiple Kafka clusters from a single instance. Just add more [[clusters]] blocks:

clusters

Each cluster gets its own cluster_name label in metrics, so you can filter by cluster in Grafana and alerts.

When to use one exporter vs. many:

Scenarios%20table

OpenTelemetry integration

Klag-exporter can also push metrics via OTEL in addition to (or instead of) the Prometheus endpoint.

otlp

This is useful if you're using Datadog, New Relic, Honeycomb, or other backends that support OTEL ingestion.

The test stack includes an OpenTelemetry Collector that receives these metrics and exports them to both debug output (logs) and a Prometheus endpoint.

Troubleshooting

Exporter shows no metrics

Check connectivity to Kafka:

connectivity

Check exporter logs:

connectivity%202

Verify authentication:

Look for SASL or SSL errors in logs. Double-check credentials and certificates.

Metrics exist but no consumer groups

Check your whitelist/blacklist patterns: If you have restrictive patterns, groups might be filtered out.
Verify consumer groups exist: kafka-consumer-groups --bootstrap-server kafka:9092 --list
Check if consumers have committed offsets: klag-exporter only shows groups that have committed offsets.

Time lag is always 0

Verify timestamp sampling is enabled:

verify

Check if consumers have lag: If group_lag is 0, group_lag_seconds will also be 0 (caught up = no time lag).

Grafana shows "No data"

Check Prometheus targets: http://localhost:9090/targets - is klag-exporter UP?
Verify the datasource: Grafana → Configuration → Data Sources → Test
Check metric names: The dashboard expects specific metric names. If you customized them, update the queries.

What you should have now

If you followed this guide, you've got:

klag-exporter collecting both offset lag and time lag metrics
Prometheus scraping and storing those metrics
Grafana dashboards showing lag trends and distributions
Alerts configured for your SLAs
Awareness of edge cases (compaction/retention detection)

Your on-call engineers can now respond to "Consumer X is 45 seconds behind" instead of "Consumer X has 50,000 messages lag" - a far more actionable alert.

Key takeaways

Offset lag tells you how many messages behind; time lag tells you how far behind in real time. Both matter, but time lag is what your SLAs are measured in.
Interpolation-based tools fail for idle or low-throughput topics. They show zero lag when you might be hours behind.
Direct timestamp sampling is more accurate but needs caching. Fetch only for partitions with lag, and cache results with a TTL.
Alert on time lag, not offset lag. Your on-call engineers will thank you when they can immediately understand severity.

If you're only monitoring offset lag today, you're flying partially blind. Adding time lag to your dashboards is the single most impactful improvement you can make to your Kafka monitoring.

The messages aren't going anywhere. But your users are waiting - and they notice the delay.

Reviewed by: Grzegorz Kocur

Contents