How to learn Reject Handling And Quarantine Tables for Data Quality And Validation in ETL Developer for free

Why this matters

In real ETL pipelines, not every record is clean. Some violate schemas, fail validation rules, or arrive too early. Reject handling and quarantine tables let you:

Prevent bad data from contaminating production fact and dimension tables.
Capture failed records with reasons for audit and reprocessing.
Support SLAs: deliver what is valid on time, while triaging the rest.
Enable targeted fixes: data stewards can correct bad records and replay them.

Typical ETL Developer tasks this enables: routing invalid rows, maintaining reason codes, building reprocessing jobs, and reporting on data quality trends.

Who this is for

ETL Developers and Data Engineers building batch or streaming pipelines.
Analytics engineers and DBAs responsible for data integrity.
Data stewards who need traceable error capture and repair flows.

Prerequisites

Comfort with SQL DDL/DML and basic data modeling.
Understanding of ETL/ELT stages: landing, staging, transform, load.
Familiarity with validation rules (nullability, referential integrity, type checks).

Concept explained simply

Think of your pipeline like airport security: valid passengers board; others are held for extra checks. A quarantine table is the holding area for records that fail validation. You store the record, the reason, and the context to fix it later.

Reject: the act of withholding a record from target tables.
Quarantine table: a table where rejected records are stored with metadata.
Reprocess: after fixing either data or rules, move the record from quarantine to the target.

Mental model

Pipeline = conveyor belt. At each checkpoint, you stamp a pass or a reason_code. Passed items continue; failed items go into a labeled bin (quarantine) with a note. A later process reads the bins, fixes items, and puts them back on the belt at the right spot.

Core patterns you should know

Row-level rejects vs. file-level quarantine: sometimes one row fails; sometimes the entire file is unreadable.
Soft vs. hard validation: soft (e.g., unknown enum) may be fixable; hard (e.g., invalid JSON) requires remediation at source.
Reason codes taxonomy: concise, stable codes like RI_MISSING_PARENT, TYPE_MISMATCH, BUSINESS_RULE_VIOLATION.
Idempotent reprocessing: ensure a record loaded after fix does not double-insert.
Traceability: always carry batch_id, load_ts, source_file, checksum, and a raw_payload snapshot.

Designing quarantine tables (practical schema)

Include these columns to support audit, troubleshooting, and reprocessing:

surrogate_id (auto), batch_id, load_ts
source_system, source_entity, source_file, record_number
reason_code, reason_details
raw_payload (string/JSON) and/or normalized columns for key fields
hash or natural keys (e.g., business_key, external_id)
triage_status (NEW, FIXED, REPROCESSED, DISCARDED)
triage_ts, triage_user

Example DDL (generic SQL)

CREATE TABLE dq_quarantine_orders (
  surrogate_id BIGINT GENERATED ALWAYS AS IDENTITY,
  batch_id VARCHAR(64) NOT NULL,
  load_ts TIMESTAMP NOT NULL,
  source_system VARCHAR(64),
  source_entity VARCHAR(64),
  source_file VARCHAR(512),
  record_number BIGINT,
  reason_code VARCHAR(64) NOT NULL,
  reason_details VARCHAR(1000),
  business_key VARCHAR(128),
  external_id VARCHAR(128),
  raw_payload TEXT,
  triage_status VARCHAR(32) DEFAULT 'NEW',
  triage_ts TIMESTAMP,
  triage_user VARCHAR(128),
  PRIMARY KEY (surrogate_id)
);
-- Recommended indexes
-- CREATE INDEX ix_q_orders_batch ON dq_quarantine_orders(batch_id);
-- CREATE INDEX ix_q_orders_reason ON dq_quarantine_orders(reason_code);
-- CREATE INDEX ix_q_orders_status ON dq_quarantine_orders(triage_status);

Partitioning tips

Partition by load_date or load_ts for pruning.
Optionally sub-partition by reason_code for fast triage queries.
Keep raw_payload for full fidelity; compress if supported.

Worked examples

Example 1: Email validation for Customers

Rule: email must match simple pattern and be non-null if contact_opt_in = true.
Process: SELECT valid rows to dim_customer; INSERT invalid rows into dq_quarantine_customers with reason_code EMAIL_INVALID or EMAIL_REQUIRED.
Reprocess: corrected emails revalidated and moved into dim_customer.

What goes to quarantine?

Null email with contact_opt_in = true → EMAIL_REQUIRED
email without '@' or bad domain → EMAIL_INVALID

Example 2: Orders missing Customer

Rule: orders.customer_id must exist in dim_customer.business_key.
Process: LEFT JOIN to dim_customer; where match is null, route to dq_quarantine_orders with reason_code RI_MISSING_PARENT and include customer_id in raw_payload.
Reprocess: once the customer dimension is updated, re-try those orders.

Example 3: Schema drift in event JSON

Rule: required field event_type and numeric amount exist.
Process: if JSON parse fails → JSON_INVALID; if required key missing → SCHEMA_REQUIRED_MISSING; if amount not numeric → TYPE_MISMATCH.
Reprocess: update parser or mapping; replay quarantined raw_payloads.

Step-by-step setup

List validation rules: nullability, types, ranges, referential integrity, business constraints.
Define reason codes: short codes + human-friendly reason_details.
Create quarantine tables: include metadata and raw_payload.
Route logic: in SQL, use WHEN/CASE; in code, if/else; in streaming, dead-letter queues to quarantine sinks.
Reprocessing job: reads NEW records, validates, loads to target, updates triage_status.
Monitoring: counts by reason_code and batch; alert on spikes.

Operations: triage and SLAs

Daily review of NEW quarantine counts and top reason_codes.
Document playbooks for each reason_code.
Set auto-expiry/discard policy for records that cannot be fixed after X days (compliance permitting).
Audit trail: keep who fixed what, when, and why.

Exercises

These are hands-on. You can complete them in any SQL environment or on paper. The Quick Test is available to everyone; only logged‑in users get saved progress.

Exercise 1: Design a quarantine table

See the Exercises section below for full instructions and a solution you can expand.

Exercise 2: Write routing logic

Implement how records are split between target and quarantine with reason codes.

I defined clear reason codes and mapped them to validation rules.
My quarantine table captures batch_id, timestamps, and raw_payload.
I planned a reprocessing workflow with idempotency.

Common mistakes and how to self-check

Missing context: Only storing a couple of columns. Self-check: Can someone fix the record from quarantine alone?
No reason taxonomy: Free-text reasons make reporting impossible. Self-check: Are reason codes standardized and documented?
One-way street: No reprocessing path. Self-check: Is there a job to retry after fixes, marking records REPROCESSED?
Duplicates on replay: Not using keys/hashes. Self-check: Are loads idempotent with natural keys or hashes?
Over-quarantining: Failing records for soft rules that could be defaulted. Self-check: Can some rules become defaults with flags?

Practical projects

Build a customer ingestion pipeline with email and country code validation, quarantine invalid rows, and a retry job.
Create a dashboard showing daily quarantine volumes by reason_code and source_system.
Implement a file-level quarantine for unreadable files with a manifest table and remediation notes.

Learning path

Before: Data profiling and validation rules.
Now: Reject handling and quarantine tables (this lesson).
After: Automated reprocessing, alerting, and DQ SLAs. Then progress to data lineage and observability.

Next steps

Finalize your reason code taxonomy and share with your team.
Add partitioning and indexes to your quarantine tables.
Schedule the reprocessing job and simulate a spike to test alerts.

Quick Test

Take the Quick Test below to check your understanding. Everyone can take it; only logged-in users will see saved progress.

Mini challenge

You ingest 2 million order lines daily. Today, 8% land in quarantine with reason_code TYPE_MISMATCH for amount. What do you do in the next 60 minutes? Write a short triage plan with: checks to run, who to notify, what to temporarily relax (if safe), and how to prevent duplicates on replay.

Menu

Reject Handling And Quarantine Tables

Table of Contents