Topic Not Found

Why this matters

Data pipelines fail in the real world: flaky APIs, late files, schema drifts, and occasional warehouse hiccups. As an Analytics Engineer, you must prevent minor glitches from becoming data incidents. Solid error handling and retry patterns keep dashboards fresh, models trustworthy, and on-call nights quiet.

Ingestion: handle 500/429 API errors without losing data.
Warehouse loads: quarantine bad rows while loading the good ones.
Transformations: re-run safely without duplicates or partial updates.
Scheduling: avoid endless retries that block downstream tasks.

Concept explained simply

Think of your pipeline as a delivery service. Errors are roadblocks. Retries are choosing another route later. Idempotency is ensuring a package is delivered once, even if the driver tries twice. Dead-letter queues are safe lockers for damaged packages. Alerts are text messages when something needs human attention.

Mental model

Expect failure: plan for transient and permanent errors.
Fail small: isolate bad rows/files without stopping the whole run.
Recover safely: re-runs must not create duplicates or corrupt data.
Be observable: log, count, alert, and keep artifacts for diagnosis.

Core patterns you will use

Retry with backoff and jitter

Retry transient errors (timeouts, 429 rate limits, brief warehouse unavailability). Use exponential backoff (e.g., 1s, 2s, 4s, 8s...) plus jitter (random ±) to avoid thundering herds. Set a max attempts and total time cap.

Good for: network blips, temporary locks.
Avoid for: validation errors, missing credentials (these won’t fix themselves).

Idempotency

Design every step to be safe to re-run.

Deterministic file naming: include date/batch_id so re-processing the same file is predictable.
Upserts/MERGE into targets keyed by natural keys, not blind INSERT.
Stage-then-swap: write to staging, validate, then atomically MERGE to production.
Checksums or load manifests to detect duplicates.

Dead-letter quarantine

Separate bad records so the rest can flow. Store rejected rows with a reason, batch_id, and timestamp. Review and fix or reprocess later.

Circuit breaker

If a dependency is consistently failing, stop hammering it. Open the circuit for a cool-off window, then try again. Prevents infinite retries, rate-limit escalations, and backlog growth.

Timeouts and budgets

Every call should have a timeout. Combine step timeouts with overall run budgets (e.g., ingestion must finish in 20 minutes) to protect SLAs.

Compensating actions

When a partial write slipped through, a follow-up action repairs state (e.g., delete or correct orphan rows). Keep these small and auditable.

Observability

Log attempts, reasons, counts of good/bad rows, and batch_ids. Emit metrics and alerts with clear run context.

Worked examples

Example 1 — API ingestion with 429s

Decision: Retries are safe because 429 is transient and data is read-only.
Plan: Exponential backoff with jitter: 2s, 4s, 8s, 16s, 32s (max 5 attempts), cap total wait at 2 minutes.
Idempotency: Write fetched pages to staging with batch_id and source_page; MERGE into final table by unique event_id.
Circuit breaker: If 3 consecutive runs exceed 15 minutes, skip this source for 30 minutes and alert.
Observability: Log attempt#, wait time, status_code; emit metrics: pages_fetched, pages_skipped, retries_used.

Example 2 — Warehouse load with bad rows

Load raw into a wide staging table (strings/variants). Avoid strict types on first touch.
Validate in SQL using safe casts and rules (e.g., date_parse, length checks).
Split: INSERT valid rows into clean_staging; INSERT invalid into quarantine with error_reason.
Upsert valid into final with MERGE on business key. This makes re-runs safe.
Report: counts of total, valid, invalid. Alert if invalid_rate > threshold.

Example 3 — Downstream model resilience

Contract: Downstream models should tolerate missing partitions for a day by using last good partition with a warning flag.
Run budget: If upstream is late, model runs with previous partition and sets freshness_status='stale'.
Compensation: When upstream recovers, re-run the model for affected dates; MERGE ensures correctness.

How to implement reliably

Always include batch_id (date, source, attempt) in staging.
Prefer MERGE/UPSERT over INSERT for targets.
Isolate ingestion, validation, and publishing as separate steps.
Set task-level retries, and also a pipeline-level cap to avoid retry storms.
Quarantine first occurrence of novel schema changes for review; allow pass-through only after rules are updated.

What to log and alert

For retries: error class, attempt#, backoff seconds, jitter.
For loads: total_rows, valid_rows, invalid_rows, invalid_rate.
For merges: inserted_count, updated_count, unchanged_count.
Alert on: invalid_rate above threshold, circuit breaker open, repeated schema drift, run exceeding budget.

Reusable SQL snippets (portable)

Idempotent upsert with deterministic staging

-- Assume clean_staging has unique natural_key and load_batch_id
MERGE INTO dim_customer AS t
USING (
  SELECT natural_key, name, email, updated_at
  FROM clean_staging
  WHERE load_batch_id = :batch_id
) AS s
ON t.natural_key = s.natural_key
WHEN MATCHED AND (t.name != s.name OR t.email != s.email OR t.updated_at < s.updated_at) THEN
  UPDATE SET name = s.name, email = s.email, updated_at = s.updated_at
WHEN NOT MATCHED THEN
  INSERT (natural_key, name, email, updated_at) VALUES (s.natural_key, s.name, s.email, s.updated_at);

Validation split (good vs bad rows)

-- Stage raw strings, then validate
INSERT INTO quarantine_orders (order_id_raw, error_reason, load_batch_id)
SELECT r.order_id, 'invalid_date', :batch_id
FROM raw_orders r
WHERE TRY_CAST(r.order_date AS DATE) IS NULL;

INSERT INTO clean_staging_orders (order_id, order_date, amount, load_batch_id)
SELECT CAST(r.order_id AS BIGINT), CAST(r.order_date AS DATE), CAST(r.amount AS DECIMAL(18,2)), :batch_id
FROM raw_orders r
WHERE TRY_CAST(r.order_date AS DATE) IS NOT NULL;

Who this is for

Analytics Engineers building or maintaining ELT/ETL pipelines.
Data/BI Engineers adding reliability to existing workflows.

Prerequisites

Comfortable with SQL (joins, MERGE/UPSERT).
Basic understanding of batch pipelines and schedulers.
Familiarity with staging vs production tables.

Learning path

Learn transient vs permanent error types.
Apply exponential backoff and jitter.
Design idempotent writes with staging and MERGE.
Add quarantine flows and alerting thresholds.
Introduce circuit breakers and run budgets.

Common mistakes and self-check

Mistake: Blind INSERTs causing duplicates on re-run. Self-check: Can this task be re-run without side effects?
Mistake: Infinite retries. Self-check: Do you have max attempts and a total time cap?
Mistake: Dropping bad rows without trace. Self-check: Are bad rows saved with reasons and batch_id?
Mistake: Overfitting to one error. Self-check: Do you categorize errors and handle classes, not single messages?
Mistake: Alert spam. Self-check: Are alerts aggregated with context (batch_id, counts) and actionable?

Practical projects

Build a small ingestion job that pulls from a rate-limited API with retries, jitter, and a circuit breaker; write to staging and MERGE to final.
Create a warehouse load that splits valid/invalid rows, quarantines bad records, and reports counts. Add a reprocess script for quarantined data.
Add run budgets and a freshness flag to a downstream model; implement a backfill that safely repairs gaps.

Exercises

Do these to make the patterns stick. The Quick Test at the end is available to everyone; log in to save your progress across sessions.

ex1 — Design robust retries for an API
Scenario: You fetch hourly from a marketing API that often returns 429. Design retries, idempotency, stop conditions, and alerting.
Deliverable: a short plan with backoff schedule, max attempts, circuit breaker rule, staging approach, and MERGE keys.
ex2 — Write a resilient SQL load
Scenario: Load daily CSVs where 1–3% rows have bad dates. Good rows must load; bad rows must be quarantined with reason. Re-runs must be idempotent.
Deliverable: SQL steps for staging, validation split, MERGE to final, and metrics.

[ ] I used exponential backoff and jitter with a clear max attempts.
[ ] My writes are idempotent (MERGE/UPSERT, or stage-then-swap).
[ ] I quarantine bad data with reasons and batch_id.
[ ] I capped total run time and defined alert thresholds.

Mini challenge

Your S3-to-warehouse load sometimes hits temporary warehouse unavailability for 5–10 minutes, and occasionally a supplier sends a file with an extra column. Propose a plan that: retries intelligently, detects schema drift without blocking good rows, and ensures re-runs are safe. Write 5–7 bullet points.

Next steps

Apply these patterns to one production job this week; add only logging and quarantine first, then idempotency.
Define standard retry/backoff defaults for your team.
Automate row-count and invalid-rate alerts with sensible thresholds.

Quick Test

Take the test to check your understanding. Anyone can take it; log in to save your score.

Menu

Error Handling And Retry Patterns

Table of Contents