Why this matters
In real ETL operations, things fail: APIs time out, schemas change, and clusters run out of memory. As an ETL Developer, you must design pipelines that fail safely and recover predictably. This protects SLAs, keeps data trustworthy, and avoids costly manual fixes.
- Meet data SLAs even when upstream services are flaky.
- Backfill and re-run jobs without duplicating rows.
- Alert the right people quickly and give them actionable context.
- Resume from checkpoints so you don’t reprocess everything.
Who this is for
- ETL/ELT developers and data engineers building scheduled pipelines.
- Analysts/analytics engineers who own daily or hourly loads.
- Platform engineers responsible for orchestration reliability.
Prerequisites
- Basic familiarity with an orchestrator (e.g., Airflow/Prefect/Dagster) concepts: tasks, DAG/flow, schedules.
- Comfort with SQL and a scripting language (e.g., Python).
- Understanding of batch vs. streaming at a high level.
Concept explained simply
Failures come in two main flavors: transient and permanent.
- Transient: short-lived issues like network blips, rate limits, or temporary unavailability. Strategy: retry with backoff and jitter.
- Permanent: structural issues like bad credentials, breaking schema changes, or corrupt files. Strategy: fail fast, alert, and require a fix before re-running.
Mental model
Think of your pipeline like a delivery route with checkpoints:
- Checkpoints (watermarks) remember the last safely delivered package.
- Deliveries are idempotent: dropping the same package twice doesn’t create duplicates.
- If a road is temporarily closed, you wait and try again (retry with backoff).
- If the address is invalid, you stop, log it, and put the package aside for review (dead-letter).
Types of failures you will see
- Resource limits: out-of-memory, disk full, concurrency/pool limits.
- Data quality: unexpected nulls, schema drift, out-of-range values.
- Integration: API 429/5xx, timeouts, auth errors, file-not-found.
- Orchestrator: task crashes, worker lost, scheduling delays.
Always classify the failure and choose the right response: retry if transient; fail and alert if permanent.
Core patterns for recovery
- Retries with exponential backoff (+ jitter): Avoid retry storms and give upstreams time to recover.
- Idempotent writes: Use upserts/merges keyed by a natural or surrogate key so re-runs don’t duplicate rows.
- Checkpointing/watermarks: Store last processed offset/date in a durable metadata table.
- Transactional/atomic loads: Load to staging, validate, then swap/merge in one transaction.
- Dead-letter queues (DLQ): Route bad records for later inspection without blocking good ones.
- Selective re-runs: Re-run failed tasks/partitions only, not the entire pipeline.
- Alerting with context: Include run id, partition, error type, and next action.
Tip: choosing retry counts
Common defaults: 3–5 retries with exponential backoff (e.g., 1m, 2m, 4m, 8m...). Add jitter of ±10–20% to avoid synchronized retries.
Worked examples
Example 1: Safe retries + idempotent merge
Goal: If the task is retried, it should not create duplicates.
# Pseudocode (Python-like)
retries = 4
retry_backoff = True # exponential
retry_jitter = True
# Idempotent load: MERGE by business_key
SQL = """
MERGE INTO dw.sales AS t
USING staging.sales_run AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET amount = s.amount, updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, amount, created_at) VALUES (s.order_id, s.amount, s.created_at);
"""
Why it works: MERGE ensures replays update/insert cleanly; backoff+jitter reduces pressure on upstreams.
Example 2: Fail fast on data quality + DLQ
Goal: Do not load broken records; capture them for later.
# Validate before load
invalid = SELECT * FROM staging.users WHERE email NOT LIKE '%@%';
IF COUNT(invalid) > 0:
WRITE invalid TO dlq/users_bad_email/partition_date
RAISE DataQualityError("Invalid emails detected")
# If no invalid rows, proceed to MERGE
Why it works: Good data loads; bad rows are quarantined for inspection. The pipeline fails with a clear reason.
Example 3: Checkpointed resume with a watermark
Goal: Resume from last successful point without reprocessing.
# metadata.watermarks(entity, last_processed_at)
last_ts = SELECT last_processed_at FROM metadata.watermarks WHERE entity = 'payments';
new_rows = SELECT * FROM raw.payments WHERE updated_at > last_ts ORDER BY updated_at ASC;
process(new_rows)
new_last_ts = MAX(updated_at) FROM new_rows
UPDATE metadata.watermarks SET last_processed_at = new_last_ts WHERE entity = 'payments';
Why it works: On failure, the watermark only advances after a successful load.
Example 4: Backfill by partition safely
Goal: Backfill specific dates without double-processing.
# For each partition_date in 2024-10-01..2024-10-07
# Read from raw where partition_date = X
# MERGE into target keyed by (partition_date, business_key)
Why it works: Partition-scoped, idempotent merges make backfills predictable.
Exercises
Complete the exercise below. Anyone can do the exercise. The Quick Test is available to everyone; only logged-in users will have saved progress.
Exercise ex1 — Build a resilient daily load
Create a daily partitioned job that:
- Retries transient failures with exponential backoff and jitter.
- Loads data idempotently via a MERGE/UPSERT keyed by (partition_date, business_key).
- Stores and uses a watermark to resume after partial failure.
Checklist:
- Retry count and backoff are configured.
- MERGE or equivalent ensures no duplicates on re-run.
- Watermark table updated only on successful completion.
- Alert includes partition date and error summary.
Common mistakes and self-check
- Only using retries for permanent errors. Self-check: Do your retries target timeouts/429, and fail fast on schema/auth errors?
- No idempotency. Self-check: If you re-run yesterday’s partition, do counts or unique keys change unexpectedly?
- Advancing watermark too early. Self-check: Is the watermark updated only after the load commits?
- Retry storms. Self-check: Do you use exponential backoff, jitter, and concurrency/pool limits?
- Poor alerts. Self-check: Do alerts include run id, partition, owner, and next action?
Troubleshooting playbook
- Transient 5xx/timeout: increase backoff window, verify upstream status, limit concurrency.
- Data quality failures: route to DLQ, document rule, fix upstream or transform.
- Schema drift: add compatibility layer (nullable columns, CAST), version checks, or block and coordinate a fix.
Practical projects
- Build a partitioned ingestion pipeline that backfills a week and proves idempotency with MERGE.
- Create a DLQ process: capture bad rows, log reasons, and add a small script to reprocess after fixes.
- Implement a metadata table for watermarks and demonstrate resume after an injected failure.
Learning path
- Start with retries/backoff and failure classification.
- Add idempotent loads (MERGE/UPSERT) and transactional staging.
- Introduce watermarks/checkpoints and selective re-runs.
- Implement DLQ and data validation gates.
- Improve observability: logging, metrics, and actionable alerts.
Next steps
- Finish the exercise and validate no duplicates on re-run.
- Take the Quick Test below to check your understanding. Anyone can take it; only logged-in users will have saved progress.
- Apply these patterns to one existing pipeline this week.
Mini challenge
Your hourly job ingests from an API that returns 5xx for 15 minutes once a day. Yesterday’s run created duplicates. In one paragraph, describe the changes you’ll make today to ensure: no duplication, controlled retries, and clear alerts with the impacted hour.