luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Handling Failures And Recovery

Learn Handling Failures And Recovery for free with explanations, exercises, and a quick test (for ETL Developer).

Published: January 11, 2026 | Updated: January 11, 2026

Why this matters

In real ETL operations, things fail: APIs time out, schemas change, and clusters run out of memory. As an ETL Developer, you must design pipelines that fail safely and recover predictably. This protects SLAs, keeps data trustworthy, and avoids costly manual fixes.

  • Meet data SLAs even when upstream services are flaky.
  • Backfill and re-run jobs without duplicating rows.
  • Alert the right people quickly and give them actionable context.
  • Resume from checkpoints so you don’t reprocess everything.

Who this is for

  • ETL/ELT developers and data engineers building scheduled pipelines.
  • Analysts/analytics engineers who own daily or hourly loads.
  • Platform engineers responsible for orchestration reliability.

Prerequisites

  • Basic familiarity with an orchestrator (e.g., Airflow/Prefect/Dagster) concepts: tasks, DAG/flow, schedules.
  • Comfort with SQL and a scripting language (e.g., Python).
  • Understanding of batch vs. streaming at a high level.

Concept explained simply

Failures come in two main flavors: transient and permanent.

  • Transient: short-lived issues like network blips, rate limits, or temporary unavailability. Strategy: retry with backoff and jitter.
  • Permanent: structural issues like bad credentials, breaking schema changes, or corrupt files. Strategy: fail fast, alert, and require a fix before re-running.

Mental model

Think of your pipeline like a delivery route with checkpoints:

  • Checkpoints (watermarks) remember the last safely delivered package.
  • Deliveries are idempotent: dropping the same package twice doesn’t create duplicates.
  • If a road is temporarily closed, you wait and try again (retry with backoff).
  • If the address is invalid, you stop, log it, and put the package aside for review (dead-letter).

Types of failures you will see

  • Resource limits: out-of-memory, disk full, concurrency/pool limits.
  • Data quality: unexpected nulls, schema drift, out-of-range values.
  • Integration: API 429/5xx, timeouts, auth errors, file-not-found.
  • Orchestrator: task crashes, worker lost, scheduling delays.

Always classify the failure and choose the right response: retry if transient; fail and alert if permanent.

Core patterns for recovery

  1. Retries with exponential backoff (+ jitter): Avoid retry storms and give upstreams time to recover.
  2. Idempotent writes: Use upserts/merges keyed by a natural or surrogate key so re-runs don’t duplicate rows.
  3. Checkpointing/watermarks: Store last processed offset/date in a durable metadata table.
  4. Transactional/atomic loads: Load to staging, validate, then swap/merge in one transaction.
  5. Dead-letter queues (DLQ): Route bad records for later inspection without blocking good ones.
  6. Selective re-runs: Re-run failed tasks/partitions only, not the entire pipeline.
  7. Alerting with context: Include run id, partition, error type, and next action.
Tip: choosing retry counts

Common defaults: 3–5 retries with exponential backoff (e.g., 1m, 2m, 4m, 8m...). Add jitter of ±10–20% to avoid synchronized retries.

Worked examples

Example 1: Safe retries + idempotent merge

Goal: If the task is retried, it should not create duplicates.

# Pseudocode (Python-like)
retries = 4
retry_backoff = True   # exponential
retry_jitter = True

# Idempotent load: MERGE by business_key
SQL = """
MERGE INTO dw.sales AS t
USING staging.sales_run AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET amount = s.amount, updated_at = s.updated_at
WHEN NOT MATCHED THEN INSERT (order_id, amount, created_at) VALUES (s.order_id, s.amount, s.created_at);
"""

Why it works: MERGE ensures replays update/insert cleanly; backoff+jitter reduces pressure on upstreams.

Example 2: Fail fast on data quality + DLQ

Goal: Do not load broken records; capture them for later.

# Validate before load
invalid = SELECT * FROM staging.users WHERE email NOT LIKE '%@%';
IF COUNT(invalid) > 0:
    WRITE invalid TO dlq/users_bad_email/partition_date
    RAISE DataQualityError("Invalid emails detected")
# If no invalid rows, proceed to MERGE

Why it works: Good data loads; bad rows are quarantined for inspection. The pipeline fails with a clear reason.

Example 3: Checkpointed resume with a watermark

Goal: Resume from last successful point without reprocessing.

# metadata.watermarks(entity, last_processed_at)
last_ts = SELECT last_processed_at FROM metadata.watermarks WHERE entity = 'payments';
new_rows = SELECT * FROM raw.payments WHERE updated_at > last_ts ORDER BY updated_at ASC;
process(new_rows)
new_last_ts = MAX(updated_at) FROM new_rows
UPDATE metadata.watermarks SET last_processed_at = new_last_ts WHERE entity = 'payments';

Why it works: On failure, the watermark only advances after a successful load.

Example 4: Backfill by partition safely

Goal: Backfill specific dates without double-processing.

# For each partition_date in 2024-10-01..2024-10-07
# Read from raw where partition_date = X
# MERGE into target keyed by (partition_date, business_key)

Why it works: Partition-scoped, idempotent merges make backfills predictable.

Exercises

Complete the exercise below. Anyone can do the exercise. The Quick Test is available to everyone; only logged-in users will have saved progress.

Exercise ex1 — Build a resilient daily load

Create a daily partitioned job that:

  • Retries transient failures with exponential backoff and jitter.
  • Loads data idempotently via a MERGE/UPSERT keyed by (partition_date, business_key).
  • Stores and uses a watermark to resume after partial failure.

Checklist:

  • Retry count and backoff are configured.
  • MERGE or equivalent ensures no duplicates on re-run.
  • Watermark table updated only on successful completion.
  • Alert includes partition date and error summary.

Common mistakes and self-check

  • Only using retries for permanent errors. Self-check: Do your retries target timeouts/429, and fail fast on schema/auth errors?
  • No idempotency. Self-check: If you re-run yesterday’s partition, do counts or unique keys change unexpectedly?
  • Advancing watermark too early. Self-check: Is the watermark updated only after the load commits?
  • Retry storms. Self-check: Do you use exponential backoff, jitter, and concurrency/pool limits?
  • Poor alerts. Self-check: Do alerts include run id, partition, owner, and next action?
Troubleshooting playbook
  • Transient 5xx/timeout: increase backoff window, verify upstream status, limit concurrency.
  • Data quality failures: route to DLQ, document rule, fix upstream or transform.
  • Schema drift: add compatibility layer (nullable columns, CAST), version checks, or block and coordinate a fix.

Practical projects

  • Build a partitioned ingestion pipeline that backfills a week and proves idempotency with MERGE.
  • Create a DLQ process: capture bad rows, log reasons, and add a small script to reprocess after fixes.
  • Implement a metadata table for watermarks and demonstrate resume after an injected failure.

Learning path

  1. Start with retries/backoff and failure classification.
  2. Add idempotent loads (MERGE/UPSERT) and transactional staging.
  3. Introduce watermarks/checkpoints and selective re-runs.
  4. Implement DLQ and data validation gates.
  5. Improve observability: logging, metrics, and actionable alerts.

Next steps

  • Finish the exercise and validate no duplicates on re-run.
  • Take the Quick Test below to check your understanding. Anyone can take it; only logged-in users will have saved progress.
  • Apply these patterns to one existing pipeline this week.

Mini challenge

Your hourly job ingests from an API that returns 5xx for 15 minutes once a day. Yesterday’s run created duplicates. In one paragraph, describe the changes you’ll make today to ensure: no duplication, controlled retries, and clear alerts with the impacted hour.

Practice Exercises

1 exercises to complete

Instructions

Create a daily job that loads sales data for a given partition_date and can safely retry and resume.

  1. Configure retries: 3–5 attempts with exponential backoff and jitter.
  2. Implement idempotent loading via MERGE/UPSERT on (partition_date, order_id).
  3. Maintain a watermark table metadata.watermarks(entity, last_processed_partition) and update only after a successful transaction.
  4. Inject a failure on the first attempt (e.g., raise an exception after staging load) to simulate partial processing.
  5. Re-run the job and verify no duplicate rows and that the watermark advances correctly.

You may express the solution in SQL + pseudocode or a Python-like script. Keep it self-contained.

Expected Output
A job/DAG or script that, after an injected failure, can be re-run without duplicate inserts and resumes from the last safe partition. The final row counts match a single successful load; the watermark table reflects the processed partition/date.

Handling Failures And Recovery — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Handling Failures And Recovery?

AI Assistant

Ask questions about this tool