Why this matters
As an ETL Developer, you will routinely face bad inputs, schema surprises, duplicates, and downstream load failures. Clear error handling rules let you detect issues early, protect data quality, and keep stakeholders informed without guesswork. You’ll use these rules to decide when to reject a record, when to fix it, when to retry, and when to stop the pipeline.
- Real tasks: validating source feeds, quarantining corrupt records, deduplicating, handling late-arriving dimensions, retrying transient failures, and alerting owners.
- Outcomes: consistent data quality, predictable operations, auditable decisions, and faster incident resolution.
Progress note
The quick test is available to everyone. If you are logged in, your progress will be saved automatically.
Concept explained simply
Error handling rules are if-then decisions for data and operations. They specify what to detect, how severe it is, what action to take, who to notify, and how to measure it.
Mental model
Think of a 5-step loop for every pipeline stage:
- Detect: Identify the condition (e.g., null in required field, duplicate key).
- Classify: Assign severity (info, warning, error, critical) and category (data quality, schema, operational, business rule).
- Decide: Choose action (skip, correct, default, route to quarantine, retry, fail).
- Act: Execute the action consistently and idempotently.
- Record: Log the event with enough context to trace, aggregate metrics, and alert if thresholds are exceeded.
Common error categories
- Data quality: missing required values, invalid formats, out-of-range numbers, invalid enums.
- Identity/deduplication: duplicate natural keys, conflicting updates.
- Referential integrity: foreign keys pointing to non-existent dimension rows.
- Schema mismatch: unexpected columns, missing required columns, type changes.
- Business rule violations: order date in the future, negative quantity.
- Operational: network timeouts, API rate limits, disk full.
- Load conflicts: primary/unique key violations on target write.
Actions and severity
- Record-level actions: correct/standardize, default, reject/quarantine, flag and continue.
- Batch/pipeline actions: retry with backoff, partial continue, fail-fast, circuit-break (stop downstream steps).
- Severities guide behavior: Warning (track, no stop), Error (quarantine/retry), Critical (fail job, immediate alert).
What to log for each rule
- Rule ID and name
- Timestamp and pipeline step
- Sample of offending values (mask PII)
- Record identifiers (e.g., order_id)
- Action taken (rejected, defaulted, retried)
- Correlation/run ID for traceability
Designing rules
Use a consistent template so your team can review, implement, and audit rules.
Rule template (copy/paste)
- Rule ID: ER-###
- Condition: Describe the exact predicate (e.g., email NOT matching regex).
- Scope: record-level or batch-level
- Severity: warning | error | critical
- Action: default | correct | reject to quarantine | dedupe | suspense table | retry | fail
- Retry policy: attempts, backoff, cutoff
- Alerting: none | email/on-call | dashboard-only
- Owner: team or role
- Metrics: counters, error rate thresholds
- Retention: how long to keep quarantined data
Thresholds you should define
- Per-rule error rate thresholds (e.g., reject rate > 2% -> fail job)
- Batch-level stop conditions (e.g., missing required column -> fail-fast)
- Operational retry/backoff limits (e.g., 3 attempts, exponential backoff up to 10 minutes)
Where to place rules in the pipeline
- Ingest/landing: schema validation, format checks, basic type casting
- Staging: data quality checks, deduplication, business rule validation
- Dimensional/serving: referential integrity resolution, upsert conflict handling
Worked examples
Example 1: Invalid emails and missing customer_id
- Condition: customer_id is null OR email invalid format
- Scope: record-level
- Severity: error
- Action: reject to quarantine storage with masked email; increment counters
- Batch rule: if rejected_records_rate > 2%, fail job and alert owner
Why this works
Bad identifiers break joins; invalid emails are often PII-sensitive and typically non-critical unless required for downstream ops. Rate threshold prevents silently ingesting a broken feed.
Example 2: Duplicate order_id within a day
- Condition: multiple records share the same order_id
- Scope: record-level resolved via grouping
- Action: deduplicate by keeping the record with the latest updated_at; log duplicates metric
- Severity: warning unless duplicates > 0.5% of batch, then error with alert
Why this works
Many sources emit corrections. A last-write-wins policy plus metrics balances freshness and safety.
Example 3: Missing dimension row (late-arriving customer)
- Condition: orders.customer_id not found in dim_customer
- Scope: record-level
- Action: route to suspense table; schedule retries for up to 72 hours
- Escalation: after expiry, move to dead-letter with incident ticket
- Severity: error
Why this works
Late-arriving dimensions are common. Suspense and retry preserve integrity without blocking all orders.
Example 4: Target load primary key conflict
- Condition: insert violates PK on target
- Scope: batch-level decision, record-level resolution
- Action: switch to upsert/merge; if conflict persists for the same key within the same run, mark critical and fail to avoid loops
- Severity: error -> critical on repeat
Why this works
Upserts handle legitimate updates; repeated conflicts often indicate mapping bugs that should stop the pipeline.
Step-by-step: define your rules
- List source contracts: required fields, types, keys, expected volumes.
- Enumerate failure modes: nulls, bad formats, duplicates, missing refs, schema changes, load conflicts, timeouts.
- Choose actions per failure mode with severities and thresholds.
- Decide retry and backoff for transient errors; define cutoffs.
- Specify logging, metrics, and PII masking.
- Document owners and alert paths.
- Pilot on a historical sample; tune thresholds; finalize.
Exercises
Do the exercise below, then compare with the sample solution.
Dataset: orders(order_id, customer_id, email, order_date, updated_at, total_amount)
- Define 6–8 error handling rules covering data quality, duplicates, ref integrity, and operational failures.
- Include condition, severity, action, threshold (if any), and alerting.
Checklist before you move on
- Every rule has a clear condition and action
- You set at least one batch-level threshold
- PII is masked in logs
- Idempotent behavior is described (dedupe or upsert)
- Owners and metrics are assigned
Common mistakes and how to self-check
- Vague conditions: replace "bad email" with an explicit regex or validator.
- Only record-level rules: add batch thresholds to catch feed-wide issues.
- No retry cutoffs: infinite retries hide real failures; set max attempts and escalate.
- Unmasked logs: ensure sensitive fields are hashed or partially masked.
- Missing owners: every actionable alert should route to a person/team.
- Non-idempotent retries: verify that reprocessing won’t duplicate records.
Self-check prompts
- Can a second run of the same batch produce duplicate outputs?
- What happens if 30% of records fail validation?
- Would an on-call engineer know what to do from the alert?
Practical projects
- Build a quarantine-and-report step for a sample CSV feed; generate a daily error summary with counts by rule and top offenders.
- Implement a suspense table for missing dimension keys with a scheduled retry; auto-close items that resolve.
- Create a rule configuration file (YAML/JSON) and a small validator that enforces those rules on a sample dataset.
Who this is for
- ETL Developers and Data Engineers defining pipeline quality gates
- Analysts/DBAs tasked with monitoring data quality
Prerequisites
- Basic SQL (joins, constraints, upsert/merge)
- Understanding of data types and common file formats (CSV/JSON/Parquet)
Learning path
- Start: Source profiling and schema validation basics
- Then: Define record-level rules and batch thresholds
- Next: Implement quarantine, suspense, and retry patterns
- Finally: Add metrics, dashboards, and alerts
Next steps
- Finish the exercise and take the quick test
- Add your rules to a config file and validate on a real sample
- Review rules with stakeholders and refine thresholds
Mini challenge
You ingest a daily customers feed. Today, 15% of rows miss last_name, and a new optional column marketing_opt_in appeared. Draft two rules: one batch-level decision for the spike in missing last_name, and one schema rule for the new column. Specify severity, action, and alerts.