Why this matters
As a Data Platform Engineer, you will run streaming jobs that feed analytics, alerting, and machine learning features. When incoming data outruns processing capacity, systems apply backpressure to avoid crashes and data loss. Handling it well keeps SLAs, minimizes consumer lag, prevents memory blowups, and keeps downstream services reliable.
- Real task: Your Kafka topic spikes from 20k to 120k msgs/sec. You must keep consumer lag stable without dropping critical data.
- Real task: A Flink job with heavy enrichment stalls during traffic bursts. You need to identify the bottleneck and tune buffers and parallelism.
- Real task: A Spark Structured Streaming pipeline misses latency SLOs because stateful operators are overloaded. You need to add backpressure-friendly limits.
Concept explained simply
Backpressure is the system’s way of saying “slow down, I can’t process as fast as you’re sending.” Producers or upstream operators are asked to reduce rate; buffers help absorb short spikes; if still overwhelmed, the system may pause, shed load, or scale.
Mental model
Imagine water pipes: valves (rate limits), tanks (buffers), filters (operators), and relief valves (drop/park). When filters slow down, upstream valves adjust flow to prevent overflow.
Key signals and knobs
- Signals: growing consumer lag, rising queue sizes, longer end-to-end latency, frequent GC/OOM, increased retries/timeouts.
- Knobs: rate limits, parallelism, batching, buffer sizes, pause/resume, windowing and watermarks, timeouts/retries/DLQs, autoscaling.
Core strategies
- Measure
- Lag and throughput per partition/operator
- Process time vs. arrival time
- Queue depth and GC time
- Control input rate
- Pull-based flow (pause/resume, request-n)
- Max records per poll/micro-batch limits
- Ingress throttling at producers
- Increase capacity
- Parallelism/partitions, autoscaling
- Batching and vectorized processing
- Use buffers wisely
- Enough to smooth bursts, not so large that latency explodes
- Apply backoff policies when nearing limits
- Protect downstream
- Timeouts, retries with jitter
- Idempotency and DLQs for poison messages
- Degrade gracefully
- Prefer drop-oldest for non-critical metrics
- Prioritize critical paths
Worked examples
Example 1: Kafka consumer stabilizing lag
Problem: Consumer lag grows during spikes; downstream service gets overwhelmed.
- Actions:
- Set max.poll.records=500–2000 to cap per-batch load.
- Use pause/resume when internal queue exceeds a threshold.
- Increase consumer group instances to match partitions.
- Batch writes to downstream (e.g., 200 records or 50 ms).
- Expected result: Lag stabilizes and drains after spikes; downstream latency stays within SLO.
Checklist for Kafka
- Group size <= partitions and scale up to increase parallelism
- max.poll.interval.ms high enough to finish batches without rebalances
- Enable idempotent producer for downstream writes if applicable
- Monitor: consumer lag, poll latency, batch duration
Example 2: Flink job with heavy enrichment
Problem: Async external calls slow the pipeline and trigger backpressure.
- Actions:
- Use Async I/O (async I/O operator with capacity 100–1000).
- Increase operator parallelism to match external service QPS budget.
- Tune network buffers; check Flink backpressure UI to locate hotspots.
- Apply circuit breaker + retry with backoff; send persistent failures to DLQ.
- Expected result: Throughput increases without blowing latency; backpressure disappears from upstream operators.
Checklist for Flink
- Balance parallelism with external service limits
- Set watermark strategy that fits event-time skew to avoid state buildup
- Use RocksDB/State TTL if state grows during spikes
Example 3: Spark Structured Streaming micro-batch control
Problem: Kafka ingestion overwhelms stateful aggregate and causes micro-batch delays.
- Actions:
- Set maxOffsetsPerTrigger to cap per-batch data rate.
- Optimize state: reduce key cardinality, prune with watermarks.
- Adjust trigger (e.g., Trigger.ProcessingTime("1s")) for steady, smaller batches.
- Autoscale executors; ensure shuffle partitions align with throughput.
- Expected result: Predictable batch times; stable latency and memory usage.
Checklist for Spark
- Monitor inputRowsPerSecond and processedRowsPerSecond
- Watermarks to prevent unbounded state
- Use foreachBatch batching to downstream sinks
Who this is for
- Data Platform Engineers operating Kafka/Flink/Spark in production
- Developers building near-real-time pipelines and services
- On-call engineers responsible for streaming SLAs
Prerequisites
- Basic streaming concepts (topics, partitions, micro-batches)
- Familiarity with Kafka, Flink, or Spark metrics
- Comfort with scaling and configuration changes
Learning path
- Recognize backpressure signals
- Apply rate control and buffering
- Scale parallelism and optimize state
- Protect downstream and add DLQs
- Automate with autoscaling and alerting
Common mistakes and self-check
- Mistake: Only increasing buffers. Fix: Combine with rate limits; check latency percentiles.
- Mistake: Scaling consumers without enough partitions. Fix: Ensure partitions >= desired parallelism.
- Mistake: Ignoring downstream capacity. Fix: Budget QPS across services; add retries with jitter and circuit breakers.
- Mistake: Unlimited state growth. Fix: Use watermarks/TTLs; verify state size trend.
- Mistake: Treating ingestion and processing as independent. Fix: Use pull-based control like pause/resume or request-n.
Self-check
- Does lag stabilize within a target window after a spike?
- Is p95 end-to-end latency within SLO during peak?
- Is GC time under 5–10% and no OOMs?
- Are retries bounded and DLQ volume acceptable?
Practical projects
- Build a Kafka pipeline with pause/resume and batch writes; run a synthetic spike and verify lag recovery.
- Create a Flink job with Async I/O and measure throughput vs. async capacity.
- Configure a Spark Structured Streaming job with maxOffsetsPerTrigger and observe micro-batch timings.
Exercises
Do these now. You can compare with the solutions below each exercise.
Exercise 1: Cap intake and protect downstream
Goal: Configure a consumer to stabilize lag while keeping downstream latency low.
- Set a per-poll/per-batch cap.
- Implement pause/resume when internal queue exceeds a threshold.
- Batch writes to downstream with a small time/size bound.
- Decide when to scale consumers vs. when to throttle producers.
Exercise 2: Diagnose and fix an operator hotspot
Goal: Use metrics to find the slowest operator and relieve backpressure.
- Inspect stage/operator metrics to find increasing queue/idle ratios.
- Increase parallelism of the bottleneck only.
- Apply watermarks/TTL to limit state if stateful.
- Add retries with backoff to external calls; send failures to DLQ.
Exercise completion checklist
- Lag graph flattens during spike
- Downstream p95 latency within SLO
- No memory errors
- Retries bounded; DLQ captures bad records
Next steps
- Template your tuning playbook (thresholds, limits, alarms).
- Add autoscaling policies triggered by stable metrics (lag growth rate, processing time).
- Run game-days: replay traffic spikes and verify recovery.
Mini challenge
You have a topic with 48 partitions and a consumer group currently at 8 instances. During a burst, lag grows. You can either double instances, increase max.poll.records by 2x, or both. Which do you try first and why? Write a short justification, considering partitioning, downstream capacity, and latency SLOs.
Ready to check yourself? Take the Quick Test below. Everyone can take it; logged-in learners will have their progress saved.