Who this is for
- Data Architects who define reliability patterns for batch and streaming pipelines.
- Senior Data Engineers who implement robust ETL/ELT jobs and connectors.
- Platform engineers designing orchestration, messaging, and data movement services.
Prerequisites
- Basic ETL/ELT concepts (ingest, transform, load, orchestration).
- Familiarity with APIs, message queues, or data lake/warehouse loaders.
- Comfort reading pseudo-configs (YAML/JSON) and reasoning about SLAs/SLOs.
Why this matters
Real tasks you will face:
- Keeping nightly loads stable despite API rate limits and intermittent network issues.
- Preventing duplicate records when a job retries after a partial failure.
- Designing dead-letter queues (DLQ) and replay procedures so bad events do not block the pipeline.
- Choosing correct retry/backoff so downstream systems recover instead of being overwhelmed.
- Meeting data freshness SLOs while containing costs and avoiding infinite retry loops.
Concept explained simply
Errors fall into two broad groups:
- Transient: temporary and often succeed on retry (e.g., 429 rate limit, network hiccup, lock contention).
- Permanent: will not succeed with retry without a change (e.g., schema mismatch, bad credentials, missing table).
Key building blocks:
- Retry policy: attempts, backoff strategy (fixed, exponential), jitter (randomization), per-try timeout.
- Idempotency: repeating an operation gives the same result. Achieved via natural keys, idempotency keys, upserts/merge, or deduplication.
- Circuit breaker: pause calls to a failing dependency for a cool-off window to prevent cascading failures.
- Dead-letter: isolate bad records/messages for analysis and selective replay.
- Poison pill detection: limit retries per record to avoid infinite loops.
- Compensation: if an operation had side effects, apply a compensating action (or ensure side effects are idempotent).
Mental model
Think of a pipeline as three safety nets:
- Guardrails: validate inputs early; fail fast on permanent errors (schema, auth, config).
- Cushion: for transient errors, apply bounded retries with exponential backoff and jitter, plus per-try timeouts.
- Escape hatch: for stubborn records, send to DLQ with rich context for later replay. The pipeline continues processing others.
Observability ties it together: counters for retries, DLQ rates, success rates; alerts trigger when thresholds are breached.
Design patterns you will use
- Fixed retry: simple, predictable. Use when load is small and dependency is stable. Example: 3 attempts, 10s wait.
- Exponential backoff with jitter: default for most transient errors to reduce thundering herds. Example: factor 2, cap 2 minutes, 20% jitter.
- Per-record retry vs whole-batch retry: prefer per-record in streaming or micro-batches to keep good data flowing.
- Idempotent sink writes: use upsert/merge on natural/business keys or hash-based dedup.
- DLQ + replay: send failed records with reason, attempt count, and trace ID. Provide replay tooling with rate limiting.
- Circuit breaker: open after error rate spikes; half-open to probe recovery; close when healthy.
- Selective suppression: skip known permanent errors quickly (e.g., unsupported file type) and log them.
Worked examples
Example 1: API ingest with rate limits
Situation: You pull from a SaaS API that returns HTTP 429 at peak times.
- Classify error: transient.
- Policy: exponential backoff + jitter, respect Retry-After header if present.
- Per-try timeout: 30s; max attempts: 6; cap: 2 minutes.
- Idempotency: request a fixed page cursor; for writes, include idempotency key.
- Stop condition: if 4xx permanent (e.g., 401), fail fast; if 5xx, keep retrying within cap.
Why jitter?
If many workers retry at once, jitter spreads the retries, reducing load spikes and improving success probability.
Example 2: Batch load to warehouse with upsert
Situation: A nightly job loads 10M rows into a warehouse. Network drops mid-load; rerun risks duplicates.
- Sink strategy: stage files; load into a temp table; merge into target on business key (idempotent upsert).
- Retry scope: re-upload only failed files; use content checksum to skip already staged files.
- Fail-fast rules: schema drift or missing columns — no retry; alert and stop.
- Metrics: files_total, files_loaded, files_skipped_checksum, merge_rows_inserted/updated.
Benefit
Merging on keys makes reruns safe. You can retry the job without duplicate records.
Example 3: Streaming with poison-pill handling
Situation: A Kafka consumer transforms events. Some events fail due to unexpected values.
- Per-record retries: attempt up to 3 times with small backoff (e.g., 100ms, 200ms, 400ms).
- Classification: transient errors (downstream timeout) vs permanent (unparseable payload).
- DLQ: send permanent failures with event payload, reason, attempt_count, trace_id.
- Replay: fix transform logic; replay DLQ with throttling; ensure dedup on sink by event_id + event_ts.
- Back-pressure: if DLQ rate exceeds threshold, open circuit breaker and scale consumers or pause partitions.
Tip
Include schema version in DLQ messages to reproduce the failure context during replay.
Exercises
Do these tasks, then compare with the solutions in the exercise panel below the lesson.
Exercise 1 — Define a resilient retry policy
- Create a retry policy for an idempotent GET call to an external API that sometimes returns 429 and timeouts.
- Include: max_attempts, backoff (exponential), jitter, per_try_timeout, overall_timeout, and stop conditions.
- Show how the policy changes for a non-idempotent POST (hint: idempotency key or no-retry + DLQ).
Exercise 2 — Design a DLQ and replay plan
- For a streaming pipeline, define DLQ message schema, retention, and when to send to DLQ.
- Outline a replay job with throttling and dedup safeguards.
- List the metrics and alerts you will monitor during replay.
Submission checklist
- [ ] Retries bounded with a clear cap and timeouts
- [ ] Jitter applied to avoid synchronized retries
- [ ] Idempotency or dedup strategy at the sink
- [ ] DLQ contains payload, reason, attempts, and trace
- [ ] Replay process is throttled and safe
Common mistakes and how to self-check
- Mistake: Infinite retries on permanent errors.
Self-check: Do you classify errors and fail fast on schema/auth/config issues? - Mistake: Fixed backoff without jitter during traffic spikes.
Self-check: Are retries randomized within a range? - Mistake: Retrying non-idempotent operations causing duplicates.
Self-check: Do you use idempotency keys or dedup on the sink? - Mistake: Whole-batch retry for a few bad records.
Self-check: Can you isolate and DLQ bad records while continuing? - Mistake: Missing per-try timeouts leading to thread exhaustion.
Self-check: Are timeouts shorter than the overall SLA and applied per attempt? - Mistake: No observability for retries/DLQ.
Self-check: Do you track retry counts, error rates, DLQ volume, and alert on thresholds?
Practical projects
- Build a resilient file ingest: pull files from an unstable SFTP, stage to object storage, and merge into a table. Include retries with jitter, checksum-based skip, and DLQ for corrupt files.
- Streaming transform with DLQ: consume events, validate schema, transform, load to a warehouse. Poison-pill detection, DLQ with replay job, dedup by event_id.
- Rate-limited API extractor: paginate through an API with 429 handling using exponential backoff and Retry-After. Add circuit breaker and metrics dashboard.
Acceptance criteria checklist
- [ ] Bounded retries with exponential backoff and jitter
- [ ] Idempotent sink writes (upsert/merge or dedup)
- [ ] DLQ with replay and throttling
- [ ] Metrics: success rate, retry count, DLQ rate; Alerts on breach
Learning path
- Next: Data quality checks and SLAs (validate before retrying endlessly).
- Then: Orchestration patterns and failure domains (task vs DAG retries).
- Then: Schema evolution strategies (prevent permanent errors at the source).
- Then: Observability and alerting (measure health, not just failures).
Next steps
- Draft a standard retry/DLQ guideline for your team using the templates from the exercises.
- Add metrics and alerts for retries and DLQ volumes to your existing pipelines.
- Run the Quick Test to confirm understanding. Note: the Quick Test is available to everyone; only logged-in users will see saved progress.
Mini challenge
You onboard a new partner feed. During first load, 8% of records fail due to unexpected date formats; the partner API also rate-limits you during work hours. In 20 minutes, sketch a plan that achieves at-least-once delivery without duplicates and meets a 2-hour freshness target.
See a sample outline
- Classify: date parse failures are permanent-per-record; 429s are transient.
- Ingest policy: exponential backoff (base 2s, cap 120s, jitter 20%), respect Retry-After, per-try timeout 30s, max attempts 6.
- Transform: strict date parser with fallback patterns; if still invalid, route to DLQ with payload, reason, attempt_count, trace_id.
- Sink: stage + merge on business key for idempotent writes; dedup on (partner_id, record_id, event_ts).
- Replay: fix parsing rules; replay DLQ with throttle (e.g., 200 rps), monitor duplicates via a dedup metric.
- SLO: alert if DLQ > 5% for 10 minutes or freshness > 90 minutes.