luvv to helpDiscover the Best Free Online Tools
Topic 6 of 8

Monitoring Lag And Throughput

Learn Monitoring Lag And Throughput for free with explanations, exercises, and a quick test (for Data Platform Engineer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

In streaming systems, bad things rarely explode instantly. They drift. Lag grows slowly, throughput dips under peak, and by the time users complain, data is hours late. Monitoring lag and throughput gives you early warning and the numbers to act fast.

  • Keep SLAs: catch rising consumer lag before downstream dashboards go stale.
  • Right-size clusters: use ingress/egress throughput trends to plan capacity.
  • Diagnose incidents: distinguish producer spikes from consumer slowdowns.
  • Protect budgets: scale only when metrics prove it’s needed.

Who this is for

  • Data Platform Engineers running streaming clusters (topics/streams, partitions, consumer groups).
  • Data Engineers building real-time pipelines feeding warehouses, ML features, or APIs.
  • SREs responsible for data freshness, latency SLOs, and on-call response.

Prerequisites

  • Basics of streaming concepts: topics/streams, partitions, consumer groups.
  • Comfort with metrics (rates, counters, histograms) and basic algebra.
  • Ability to read timestamps in logs/records.

Concept explained simply

Lag is how far consumers are behind the latest produced data. Throughput is the rate of data moving through the system (messages/sec or bytes/sec). You monitor both to know if the system keeps up with producers and stays within your data freshness targets.

Mental model: two conveyor belts

Imagine two conveyor belts:

  • Belt A (producers) drops boxes at a certain speed (ingress throughput).
  • Belt B (consumers) picks boxes at another speed (egress throughput).

The gap between the newest box on A and the last picked box on B is lag (in boxes). If A runs faster than B for a while, the gap grows; if B runs faster, the gap shrinks.

Key metrics and formulas

  • Per-partition consumer lag (messages) = latest_offset - committed_offset
  • Total lag (messages) = sum of per-partition lag across the consumer group
  • Approx time lag (seconds) ≈ lag_messages / producer_rate_messages_per_sec
  • Ingress throughput = produced_messages_per_sec × avg_message_size_bytes (or bytes/sec if measured directly)
  • Egress throughput = consumed_messages_per_sec × avg_message_size_bytes
  • End-to-end latency (event freshness) ≈ now - event_time_of_record_being_consumed (requires event timestamps)
  • Backpressure signal: sustained (ingress_throughput > egress_throughput) → lag increases

Monitoring setup basics (vendor-neutral)

  • Producer metrics: messages/sec, bytes/sec, send errors/retries.
  • Broker/cluster metrics: topic/partition latest offsets, bytes in/out, request latencies, ISR/replication health (if applicable).
  • Consumer group metrics: per-partition committed offset, lag, fetch/processing rates, processing latency.
  • End-to-end: compute freshness using event timestamps in records, or measure publish-to-consume delay if available.
  1. Choose SLOs: e.g., 95% of data fresh within 60s. Decide what “fresh” means (event-time vs processing-time).
  2. Pick primary indicators: total consumer lag, time lag estimate, ingress/egress throughput, per-partition lag max, stalled partition detection.
  3. Baseline: observe normal peaks/off-peaks for at least a few days.
  4. Alert rules: use sustained conditions (e.g., 5–10 min) and per-partition max lag thresholds to reduce false positives.
What about partitions?

Lag is uneven across partitions. Always track:

  • Max per-partition lag (find hotspots).
  • Skew: standard deviation of per-partition lag or consumption rate.
  • Assignment churn: frequent rebalances can cause temporary lag spikes.

Worked examples

Example 1 — Compute lag and time behind

Latest offset: 50,000; committed offset: 49,200 → lag = 800 messages. Producer rate: 2,000 msg/s → time lag ≈ 800 / 2000 = 0.4 s.

Example 2 — Throughput check

Consumer processes 12,000 msg/s, avg size 800 bytes → egress throughput ≈ 9.6 MB/s (using 1 MB = 1,000,000 bytes).

Example 3 — Diagnosing a trend

  • Ingress stable at 30 MB/s.
  • Egress dips from 30 → 24 MB/s for 15 minutes.
  • Total lag climbs steadily from 0 → 27,000 messages.

Interpretation: consumers slower than producers; expect lag growth. Action: scale consumers, optimize processing, or reduce per-message work.

Interpreting common patterns

  • Lag up, ingress flat, egress down → consumers throttled or slowed by downstream I/O.
  • Lag up, ingress spike, egress steady → producer burst; if short-lived, lag should recover.
  • Lag flat but end-to-end latency high → consumers keep up with offsets, but records are old (event time). Possibly upstream delay.
  • Lag concentrated in 1–2 partitions → partition skew; check key distribution and consumer assignment.
Quick self-check on SLOs
  • Did you define freshness using event time or processing time?
  • Do alerts fire on sustained conditions, not single spikes?
  • Do you alert on max per-partition lag as well as total lag?

Exercises

Do these here, then compare with the solutions. Your results won’t auto-save unless you’re logged in.

Exercise 1 — Calculate lag and throughput

Data: latest offset = 10,500; committed offset = 10,320; avg record size = 1.2 KB; producer rate = 400 msg/s; consumer rate = 350 msg/s.

  • Compute lag (messages) and time lag (seconds).
  • Compute consumer egress throughput (MB/s).
Show expected answer format

Lag: X messages; Time lag: Y seconds; Consumer throughput: Z MB/s.

Exercise 2 — Design alert thresholds

SLO: 95% freshness within 30s. Typical ingress 10,000 msg/s, avg size 2 KB, 12 partitions. Consumer rate ~9,500 msg/s at steady state.

  • Propose alert thresholds for: total lag, max per-partition lag, and throughput drop.
  • Specify sustained durations to limit noise.
Hint

Time lag ≈ total_lag / ingress_rate. Aim for alerts that trigger before hitting 30s, with buffers.

Checklist — before you set alerts live

  • Primary metrics: total lag, max per-partition lag, ingress/egress throughput.
  • Freshness metric chosen and instrumented (event time or processing time).
  • Baselines collected for peaks and off-peaks.
  • Sustained conditions (e.g., 5–10 min) added to alerts.
  • Runbook links and on-call messages drafted.

Common mistakes (and how to self-check)

  • Only tracking total lag. Fix: also track max per-partition lag and skew.
  • Alerting on single spikes. Fix: use sustained thresholds and rate-of-change checks.
  • Confusing throughput with freshness. Fix: compute time lag and end-to-end latency explicitly.
  • Ignoring event time. Fix: store event timestamps and compute freshness from them.
  • Byte vs message unit mix-ups. Fix: label dashboards clearly and convert consistently.

Practical projects

  • Build a lag dashboard: total and max per-partition lag, plus time lag estimate.
  • Freshness SLO panel: percentile of event-time freshness (e.g., 95% within X seconds).
  • Auto-recovery experiment: simulate consumer slowdown and verify alerting and catch-up behavior.

Learning path

  • Next in Streaming Platform Basics: partitioning and keying to reduce skew.
  • Then: consumer scaling patterns and backpressure handling.
  • Later: exactly-once/at-least-once trade-offs and impact on lag.

Next steps

  • Finish the two exercises above.
  • Take the Quick Test below. Available to everyone; only logged-in users get saved progress.
  • Apply thresholds in a staging environment and validate with load.

Mini challenge

Ingress = 25,000 msg/s, consumers = 27,000 msg/s, current total lag = 60,000 messages. If rates stay stable, how long to clear the backlog to near zero?

Reveal answer

Net drain rate ≈ 27,000 - 25,000 = 2,000 msg/s. Time ≈ 60,000 / 2,000 = 30 s.

Quick Test

Everyone can take the test; only logged-in users will see saved progress.

Practice Exercises

2 exercises to complete

Instructions

Given:

  • Latest offset = 10,500
  • Committed offset = 10,320
  • Avg record size = 1.2 KB
  • Producer rate = 400 msg/s
  • Consumer rate = 350 msg/s

Tasks:

  • Compute lag (messages) and time lag (seconds).
  • Compute consumer egress throughput (MB/s, use 1 MB = 1,000 KB).
Expected Output
Lag: 180 messages; Time lag: 0.45 s; Consumer throughput: 0.42 MB/s

Monitoring Lag And Throughput — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Monitoring Lag And Throughput?

AI Assistant

Ask questions about this tool