Why this matters
In streaming systems, bad things rarely explode instantly. They drift. Lag grows slowly, throughput dips under peak, and by the time users complain, data is hours late. Monitoring lag and throughput gives you early warning and the numbers to act fast.
- Keep SLAs: catch rising consumer lag before downstream dashboards go stale.
- Right-size clusters: use ingress/egress throughput trends to plan capacity.
- Diagnose incidents: distinguish producer spikes from consumer slowdowns.
- Protect budgets: scale only when metrics prove it’s needed.
Who this is for
- Data Platform Engineers running streaming clusters (topics/streams, partitions, consumer groups).
- Data Engineers building real-time pipelines feeding warehouses, ML features, or APIs.
- SREs responsible for data freshness, latency SLOs, and on-call response.
Prerequisites
- Basics of streaming concepts: topics/streams, partitions, consumer groups.
- Comfort with metrics (rates, counters, histograms) and basic algebra.
- Ability to read timestamps in logs/records.
Concept explained simply
Lag is how far consumers are behind the latest produced data. Throughput is the rate of data moving through the system (messages/sec or bytes/sec). You monitor both to know if the system keeps up with producers and stays within your data freshness targets.
Mental model: two conveyor belts
Imagine two conveyor belts:
- Belt A (producers) drops boxes at a certain speed (ingress throughput).
- Belt B (consumers) picks boxes at another speed (egress throughput).
The gap between the newest box on A and the last picked box on B is lag (in boxes). If A runs faster than B for a while, the gap grows; if B runs faster, the gap shrinks.
Key metrics and formulas
- Per-partition consumer lag (messages) = latest_offset - committed_offset
- Total lag (messages) = sum of per-partition lag across the consumer group
- Approx time lag (seconds) ≈ lag_messages / producer_rate_messages_per_sec
- Ingress throughput = produced_messages_per_sec × avg_message_size_bytes (or bytes/sec if measured directly)
- Egress throughput = consumed_messages_per_sec × avg_message_size_bytes
- End-to-end latency (event freshness) ≈ now - event_time_of_record_being_consumed (requires event timestamps)
- Backpressure signal: sustained (ingress_throughput > egress_throughput) → lag increases
Monitoring setup basics (vendor-neutral)
- Producer metrics: messages/sec, bytes/sec, send errors/retries.
- Broker/cluster metrics: topic/partition latest offsets, bytes in/out, request latencies, ISR/replication health (if applicable).
- Consumer group metrics: per-partition committed offset, lag, fetch/processing rates, processing latency.
- End-to-end: compute freshness using event timestamps in records, or measure publish-to-consume delay if available.
- Choose SLOs: e.g., 95% of data fresh within 60s. Decide what “fresh” means (event-time vs processing-time).
- Pick primary indicators: total consumer lag, time lag estimate, ingress/egress throughput, per-partition lag max, stalled partition detection.
- Baseline: observe normal peaks/off-peaks for at least a few days.
- Alert rules: use sustained conditions (e.g., 5–10 min) and per-partition max lag thresholds to reduce false positives.
What about partitions?
Lag is uneven across partitions. Always track:
- Max per-partition lag (find hotspots).
- Skew: standard deviation of per-partition lag or consumption rate.
- Assignment churn: frequent rebalances can cause temporary lag spikes.
Worked examples
Example 1 — Compute lag and time behind
Latest offset: 50,000; committed offset: 49,200 → lag = 800 messages. Producer rate: 2,000 msg/s → time lag ≈ 800 / 2000 = 0.4 s.
Example 2 — Throughput check
Consumer processes 12,000 msg/s, avg size 800 bytes → egress throughput ≈ 9.6 MB/s (using 1 MB = 1,000,000 bytes).
Example 3 — Diagnosing a trend
- Ingress stable at 30 MB/s.
- Egress dips from 30 → 24 MB/s for 15 minutes.
- Total lag climbs steadily from 0 → 27,000 messages.
Interpretation: consumers slower than producers; expect lag growth. Action: scale consumers, optimize processing, or reduce per-message work.
Interpreting common patterns
- Lag up, ingress flat, egress down → consumers throttled or slowed by downstream I/O.
- Lag up, ingress spike, egress steady → producer burst; if short-lived, lag should recover.
- Lag flat but end-to-end latency high → consumers keep up with offsets, but records are old (event time). Possibly upstream delay.
- Lag concentrated in 1–2 partitions → partition skew; check key distribution and consumer assignment.
Quick self-check on SLOs
- Did you define freshness using event time or processing time?
- Do alerts fire on sustained conditions, not single spikes?
- Do you alert on max per-partition lag as well as total lag?
Exercises
Do these here, then compare with the solutions. Your results won’t auto-save unless you’re logged in.
Exercise 1 — Calculate lag and throughput
Data: latest offset = 10,500; committed offset = 10,320; avg record size = 1.2 KB; producer rate = 400 msg/s; consumer rate = 350 msg/s.
- Compute lag (messages) and time lag (seconds).
- Compute consumer egress throughput (MB/s).
Show expected answer format
Lag: X messages; Time lag: Y seconds; Consumer throughput: Z MB/s.
Exercise 2 — Design alert thresholds
SLO: 95% freshness within 30s. Typical ingress 10,000 msg/s, avg size 2 KB, 12 partitions. Consumer rate ~9,500 msg/s at steady state.
- Propose alert thresholds for: total lag, max per-partition lag, and throughput drop.
- Specify sustained durations to limit noise.
Hint
Time lag ≈ total_lag / ingress_rate. Aim for alerts that trigger before hitting 30s, with buffers.
Checklist — before you set alerts live
- Primary metrics: total lag, max per-partition lag, ingress/egress throughput.
- Freshness metric chosen and instrumented (event time or processing time).
- Baselines collected for peaks and off-peaks.
- Sustained conditions (e.g., 5–10 min) added to alerts.
- Runbook links and on-call messages drafted.
Common mistakes (and how to self-check)
- Only tracking total lag. Fix: also track max per-partition lag and skew.
- Alerting on single spikes. Fix: use sustained thresholds and rate-of-change checks.
- Confusing throughput with freshness. Fix: compute time lag and end-to-end latency explicitly.
- Ignoring event time. Fix: store event timestamps and compute freshness from them.
- Byte vs message unit mix-ups. Fix: label dashboards clearly and convert consistently.
Practical projects
- Build a lag dashboard: total and max per-partition lag, plus time lag estimate.
- Freshness SLO panel: percentile of event-time freshness (e.g., 95% within X seconds).
- Auto-recovery experiment: simulate consumer slowdown and verify alerting and catch-up behavior.
Learning path
- Next in Streaming Platform Basics: partitioning and keying to reduce skew.
- Then: consumer scaling patterns and backpressure handling.
- Later: exactly-once/at-least-once trade-offs and impact on lag.
Next steps
- Finish the two exercises above.
- Take the Quick Test below. Available to everyone; only logged-in users get saved progress.
- Apply thresholds in a staging environment and validate with load.
Mini challenge
Ingress = 25,000 msg/s, consumers = 27,000 msg/s, current total lag = 60,000 messages. If rates stay stable, how long to clear the backlog to near zero?
Reveal answer
Net drain rate ≈ 27,000 - 25,000 = 2,000 msg/s. Time ≈ 60,000 / 2,000 = 30 s.
Quick Test
Everyone can take the test; only logged-in users will see saved progress.