Why this matters
Referential integrity checks ensure every foreign key value in your data actually points to a valid parent record. As an ETL Developer, this prevents orphan records, broken joins, and misleading metrics when building dashboards or machine learning features.
- Prevent orphaned facts (e.g., a sale referencing a missing customer).
- Catch late-arriving dimensions and SCD2 gaps before downstream failures.
- Maintain trust in analytics by enforcing business relationships (e.g., active child must reference active parent).
Concept explained simply
Referential integrity means “all references are valid.” If a table (child) stores a key that points to another table (parent), that parent record must exist and match the business rules (like time validity or active status).
Mental model
Imagine sticky notes connected by strings. Each string (foreign key) must land on an existing note (parent). If the note is missing, the string dangles (orphan). For time-bound dimensions (SCD2), the string must land on the version valid at that time, not just any note.
Core rules for referential integrity checks
- Orphan check: Every child foreign key should match a parent key.
- Time-valid check (SCD2): The child event date must fall within the parent version’s valid_from to valid_to window.
- Active-state check: If parents can be soft-deleted or inactivated, active children should not reference inactive parents (unless explicitly allowed and flagged).
- Cardinality guardrails: Enforce expected one-to-many or one-to-one relationships via duplicate checks.
- Unknown/Default member policy: If your model uses an “Unknown” member (e.g., 0 or -1), track it; treat as warning or acceptable according to policy.
- Nullability: If the foreign key is nullable by design, exclude nulls from failure stats but monitor volumes.
Worked examples
Example 1: Orphan detection (simple FK)
Goal: Ensure every order references an existing customer.
-- Find orders with no matching customer
SELECT o.order_id, o.customer_id
FROM staging_orders o
LEFT JOIN dim_customer c
ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL; -- Orphans
Action: Fail or quarantine these rows; optionally map to an Unknown customer if policy allows.
Example 2: Late-arriving dimension (map and hold)
Scenario: fact_sales has natural_key customer_number, but dim_customer uses surrogate keys. The dimension row might arrive after the fact.
-- Detect facts whose customer_number is not yet in dim_customer
SELECT f.fact_id, f.customer_number
FROM fact_sales f
LEFT JOIN dim_customer d
ON f.customer_number = d.customer_number
WHERE d.customer_number IS NULL; -- Late/unknown
Action: Route facts to a holding area, load an Unknown member, or trigger a dimension backfill job.
Example 3: SCD2 time-validity check
Goal: Ensure a sale joins to the correct dim_customer version for the sale_date.
-- Valid join should be within the active window for the sale date
SELECT f.fact_id, f.sale_date, f.customer_number
FROM fact_sales f
LEFT JOIN dim_customer d
ON f.customer_number = d.customer_number
AND f.sale_date >= d.valid_from
AND f.sale_date < COALESCE(d.valid_to, '9999-12-31')
WHERE d.customer_sk IS NULL; -- No valid time-bound match
Action: Flag as data quality error; investigate gaps or incorrect SCD2 dates.
Example 4: Soft-deletes on parent
Goal: Prevent active subscriptions referencing inactive accounts.
SELECT s.subscription_id, s.account_id
FROM fact_subscription s
JOIN dim_account a
ON s.account_id = a.account_id
WHERE s.is_active = 1
AND a.is_active = 0; -- Violates active-state rule
Action: Fail or reassign based on business rules; ensure upstream logic updates parent states promptly.
How to implement in ETL pipelines
- Define rules: For each relationship, write plain-language rules (orphan, time-validity, active-state, unknown policy).
- Implement checks: Add SQL queries or transformations that output only violations.
- Decide actions: Fail the load, quarantine bad rows, or map to Unknown member per policy.
- Track metrics: Count violations, unknown usage, and trend them in data quality dashboards.
- Automate alerts: Threshold-based alerts when violations exceed acceptable limits.
Tip: Lightweight structure for checks
- A standardized audit table with columns: check_name, run_id, total_scanned, violations, severity, sample_rows.
- Write violations to a quarantine table for investigation and replay.
Exercises
These mirror the graded exercises below. Do them here first, then open the solutions there to compare.
Exercise 1 (ex1): Find orphan line items
Tables: line_items(line_item_id, product_id, qty), products(product_id).
Task: Write a query that lists line_item_id and product_id where the product does not exist in products.
Need a nudge?
- Use a LEFT JOIN from child to parent.
- Filter where parent key is NULL.
Exercise 2 (ex2): Validate SCD2 window
Tables: fact_sales(fact_id, product_code, sale_date), dim_product(product_sk, product_code, valid_from, valid_to).
Task: Return fact_id and product_code for facts that do not fall into any valid SCD2 window.
Hint
- Use NOT EXISTS or a LEFT JOIN with date-range conditions and NULL filter.
Exercise checklist
- I filtered on NULL from the parent side to catch orphans.
- I used inclusive start and exclusive end for date ranges to avoid overlaps.
- I considered how Unknown member policy affects pass/fail criteria.
Common mistakes and self-checks
- Mixing inclusive/exclusive date bounds: Ensure valid_to is exclusive (e.g., < valid_to). Self-check: Are there overlapping SCD2 rows?
- Ignoring soft-deletes: Child rows pointing to inactive parents pass simple FK checks but violate business rules. Self-check: Add active-state predicates.
- Assuming null FK is always bad: If nullable by design, treat separately. Self-check: Compare null FK volume against baseline.
- Not monitoring Unknown member growth: Self-check: Trend counts for key 0/-1 and alert on spikes.
- Checking after load only: Self-check: Run pre-load validations in staging to fail fast.
Mini challenge
You receive facts with customer_number present, but 8% fail the SCD2 time-valid match. Draft a 3-step remediation plan that addresses: (1) detection query, (2) temporary handling in the warehouse, (3) upstream fix to dimension versioning. Keep it concise and actionable.
Who this is for
- ETL Developers and Data Engineers building pipelines across staging, warehouse, and marts.
- Analytics Engineers validating joins and metrics.
- Data Stewards defining data quality policies.
Prerequisites
- Comfort with SQL joins and NULL semantics.
- Basic understanding of primary/foreign keys and SCD2 concepts.
- Familiarity with your company’s Unknown member policy (if any).
Learning path
- Review business relationships and define integrity rules per table pair.
- Write orphan and time-validity checks in SQL.
- Add soft-delete and cardinality checks.
- Automate metrics and thresholds in your orchestration.
- Run the Quick Test below and integrate checks into a small project.
Practical projects
- Build a referential integrity audit layer for a retail star schema (customers, products, orders, order_items).
- Implement SCD2 time-window validation and quarantine for late-arriving dimensions.
- Create a dashboard tracking orphan counts, Unknown usage, and active-state violations over time.
Next steps
- Do the exercises and compare with solutions.
- Take the Quick Test to confirm understanding. Anyone can take it; log in to save your progress and resume later.
- Apply these checks to one real pipeline at work or in a sandbox dataset.