Why this matters
As an ETL Developer, you move and transform data that powers dashboards, machine learning features, billing, and audits. Data checks help you detect bad data early, prevent incident cascades, and build trust with stakeholders.
- Stop bad loads: Fail or quarantine when critical checks fail.
- Protect SLAs: Alert when freshness or volume drifts.
- Support audits: Keep an inspection trail of checks and outcomes.
Concept explained simply
A data check is a small test that confirms an assumption about your data. You run many checks at the right points in your pipeline to catch issues fast.
Mental model
Think of your pipeline like a factory line. Each station (ingest, stage, transform, load) has specific quality gates. Each gate runs quick tests with clear pass/fail criteria and actions (warn, quarantine, stop the line).
Common check types (open to scan)
- Schema: expected columns, data types, constraints.
- Completeness: required fields are not null; row counts meet expectations.
- Uniqueness: primary keys have no duplicates.
- Validity: values in allowed sets or patterns (e.g., ISO dates, emails, country codes).
- Referential integrity: foreign keys have matching parent rows.
- Ranges & distribution: numbers within min/max or reasonable distribution.
- Consistency: derivations match (e.g., line total equals qty * price).
- Freshness & timeliness: data is updated on schedule; no long lags.
- Volume & drift: row counts are within expected bands; schema drift detection.
- Business rules: domain-specific constraints (e.g., order date <= ship date).
How to define data checks
- Identify critical data elements: Columns that drive joins, aggregations, billing, or compliance.
- Map pipeline points: Source, staging, transformation, pre-load, post-load.
- Select check types per point: Lightweight at source; deeper checks pre/post load.
- Set thresholds and severity:
- Error: stop the job or quarantine.
- Warning: log and alert, continue if acceptable.
- Define sampling or full-scan strategy: Full scans for small tables; sampled for large with periodic full checks.
- Choose frequency: Every run, daily summary, or anomaly-only.
- Decide actions on failure: Fail, quarantine, auto-fix, or alert-and-continue.
- Document check spec: Name, objective, rule, threshold, severity, run point, owner, contact, runbook link (runbook lives in your docs, not here).
Minimal spec template (copy/paste)
name: <unique_check_name> objective: <what risk it prevents> run_point: source|staging|transform|preload|postload rule: <SQL/logic or pattern> threshold: <condition e.g., failures = 0 or freshness_hours <= 24> severity: error|warning action_on_fail: fail_job|quarantine|alert_only owner: team-or-person notes: <edge cases>
Implementing checks in ETL workflows
- Source ingest: schema conformity, basic null checks, volume sanity.
- Staging: deduplication, type coercion checks, pattern validation.
- Transform: business rules, consistency math, referential integrity.
- Pre-load: key uniqueness in target, row count reconciliation.
- Post-load: row deltas vs previous load, aggregation spot-checks, freshness metrics.
Worked examples
Example 1: Uniqueness for orders
Objective: Ensure order_id is unique before loading the fact table.
Rule (SQL): SELECT order_id, COUNT(*) c FROM staging.orders GROUP BY order_id HAVING COUNT(*) > 1; Threshold: zero rows Severity: error Action: quarantine offending rows and continue with clean set
Maintenance tip
Track duplicate rate. If occasional 1-2 duplicates occur from a known upstream retry, keep severity as error but implement quarantine-and-continue to avoid full pipeline stops.
Example 2: Referential integrity
Objective: Every orders.customer_id must exist in dim_customers.
Rule (SQL): SELECT o.customer_id FROM staging.orders o LEFT JOIN dim.customers c ON o.customer_id = c.customer_id WHERE c.customer_id IS NULL; Threshold: zero rows Severity: error Action: fail job if orphan rate > 0.1% of batch; else quarantine and alert
Why two thresholds?
You can set a hard error for any orphans, but use a tiny tolerance (0.1%) for rare late-arriving dimensions to reduce false positives. Document rationale.
Example 3: Freshness for daily file
Objective: Confirm the daily feed arrives within 24 hours.
Rule (logic): now() - max(event_time) <= 24 hours Metric source: landing timestamp or latest record timestamp Threshold: freshness_hours <= 24 Severity: warning at 20h, error at 24h Action: alert at 20h drift, fail downstream loads at 24h
Edge case
Holidays or planned downtime: add a maintenance window suppression to avoid noise, with explicit start/end and approval.
Monitoring and maintenance
- Version checks: keep a version field for each check. Record changes and reasons.
- Tune thresholds: start strict on critical keys; gradually tighten others using observed distributions.
- Alert hygiene: route to the right channel; auto-close when recovered.
- Ownership: assign on-call rotation or an owner field per check.
- Runbooks: each error check needs a short “how to investigate and fix”.
- Change management: when upstream schema changes, update check specs and coordinate rollouts.
Lightweight logging schema
check_definitions(check_id, name, run_point, rule_text, threshold_text, severity, owner, version, active_flag) check_runs(run_id, started_at, finished_at, pipeline, dataset, status) check_results(run_id, check_id, version, status, failures_count, sample_payload)
Exercises & practice
Tip: The quick test is available to everyone; only logged-in users get saved progress on exercises and the test.
Exercise 1: Design core checks
Given tables staging.customers(id, email, country_code), staging.orders(order_id, customer_id, amount, order_date), define checks for:
- Uniqueness: order_id
- Referential integrity: orders.customer_id in customers.id
- Validity: country_code in [US, CA, MX]
- Consistency: amount >= 0 and order_date not in the future
Deliver: spec snippets with threshold and actions. See the exercise card below for sample data and expected outputs.
Exercise 2: Adjust freshness policy
Your partner team moves their job window; data may arrive up to 36 hours late for two weeks. Update your freshness check and alert plan while keeping safety.
- Keep early warning.
- Avoid false job failures.
- Document temporary window with an expiry date.
Checklist before you proceed
- Have you set severity based on business impact?
- Is there a clear action on failure?
- Are checks placed at the earliest sensible pipeline point?
- Did you avoid noisy alerts?
Common mistakes and self-check
- Too many noisy warnings: prune or raise severity where needed.
- Late checks: move them earlier to fail fast.
- No owner/runbook: assign and document.
- Hard thresholds without data: bootstrap with profiling, then tighten.
- Binary pass/fail only: add context metrics (counts, samples) for diagnosis.
Self-check
- Can you explain why each check exists in one sentence?
- Do you know the action taken when it fails?
- Can you find the last 3 failures and what changed since?
Practical projects
- Build a check registry: a simple table-backed catalog with 10 checks across 3 datasets, plus a weekly summary report.
- Quarantine pipeline: implement a branch that isolates bad rows and produces a daily CSV of offenders.
- Freshness dashboard: compute freshness per dataset and highlight red/yellow/green states.
Mini challenge
Design a minimum set of 6 checks for a product catalog (products, categories, prices) that would catch 80% of real-world issues. Include severity and actions. Keep it to 10 minutes.
Hint
- PK uniqueness, FK integrity
- Non-null names, positive prices
- Category in allowed set
- Freshness on price updates
Who this is for
- ETL Developers building batch or streaming pipelines
- Data Engineers responsible for data SLAs
- Analytics Engineers maintaining data models
Prerequisites
- Comfortable with SQL
- Basic ETL/ELT workflow understanding
- Familiarity with your team’s alerting and incident process
Learning path
- Start: Define critical data elements and map pipeline points.
- Next: Implement 5–10 high-impact checks with logging.
- Then: Add alerting, quarantine flows, and weekly summaries.
- Finally: Tune thresholds, add anomaly detection for drift.
Next steps
- Complete the exercises below and compare with solutions.
- Take the Quick Test to validate retention.
- Pick one production dataset this week and add at least 3 checks.