How to learn Defining And Maintaining Data Checks for Data Quality And Validation in ETL Developer for free

Why this matters

As an ETL Developer, you move and transform data that powers dashboards, machine learning features, billing, and audits. Data checks help you detect bad data early, prevent incident cascades, and build trust with stakeholders.

Stop bad loads: Fail or quarantine when critical checks fail.
Protect SLAs: Alert when freshness or volume drifts.
Support audits: Keep an inspection trail of checks and outcomes.

Concept explained simply

A data check is a small test that confirms an assumption about your data. You run many checks at the right points in your pipeline to catch issues fast.

Mental model

Think of your pipeline like a factory line. Each station (ingest, stage, transform, load) has specific quality gates. Each gate runs quick tests with clear pass/fail criteria and actions (warn, quarantine, stop the line).

Common check types (open to scan)

Schema: expected columns, data types, constraints.
Completeness: required fields are not null; row counts meet expectations.
Uniqueness: primary keys have no duplicates.
Validity: values in allowed sets or patterns (e.g., ISO dates, emails, country codes).
Referential integrity: foreign keys have matching parent rows.
Ranges & distribution: numbers within min/max or reasonable distribution.
Consistency: derivations match (e.g., line total equals qty * price).
Freshness & timeliness: data is updated on schedule; no long lags.
Volume & drift: row counts are within expected bands; schema drift detection.
Business rules: domain-specific constraints (e.g., order date <= ship date).

How to define data checks

Identify critical data elements: Columns that drive joins, aggregations, billing, or compliance.
Map pipeline points: Source, staging, transformation, pre-load, post-load.
Select check types per point: Lightweight at source; deeper checks pre/post load.
Set thresholds and severity:
- Error: stop the job or quarantine.
- Warning: log and alert, continue if acceptable.
Define sampling or full-scan strategy: Full scans for small tables; sampled for large with periodic full checks.
Choose frequency: Every run, daily summary, or anomaly-only.
Decide actions on failure: Fail, quarantine, auto-fix, or alert-and-continue.
Document check spec: Name, objective, rule, threshold, severity, run point, owner, contact, runbook link (runbook lives in your docs, not here).

Minimal spec template (copy/paste)

name: <unique_check_name>
objective: <what risk it prevents>
run_point: source|staging|transform|preload|postload
rule: <SQL/logic or pattern>
threshold: <condition e.g., failures = 0 or freshness_hours <= 24>
severity: error|warning
action_on_fail: fail_job|quarantine|alert_only
owner: team-or-person
notes: <edge cases>

Implementing checks in ETL workflows

Source ingest: schema conformity, basic null checks, volume sanity.
Staging: deduplication, type coercion checks, pattern validation.
Transform: business rules, consistency math, referential integrity.
Pre-load: key uniqueness in target, row count reconciliation.
Post-load: row deltas vs previous load, aggregation spot-checks, freshness metrics.

Worked examples

Example 1: Uniqueness for orders

Objective: Ensure order_id is unique before loading the fact table.

Rule (SQL):
SELECT order_id, COUNT(*) c
FROM staging.orders
GROUP BY order_id
HAVING COUNT(*) > 1;
Threshold: zero rows
Severity: error
Action: quarantine offending rows and continue with clean set

Maintenance tip

Track duplicate rate. If occasional 1-2 duplicates occur from a known upstream retry, keep severity as error but implement quarantine-and-continue to avoid full pipeline stops.

Example 2: Referential integrity

Objective: Every orders.customer_id must exist in dim_customers.

Rule (SQL):
SELECT o.customer_id
FROM staging.orders o
LEFT JOIN dim.customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
Threshold: zero rows
Severity: error
Action: fail job if orphan rate > 0.1% of batch; else quarantine and alert

Why two thresholds?

You can set a hard error for any orphans, but use a tiny tolerance (0.1%) for rare late-arriving dimensions to reduce false positives. Document rationale.

Example 3: Freshness for daily file

Objective: Confirm the daily feed arrives within 24 hours.

Rule (logic): now() - max(event_time) <= 24 hours
Metric source: landing timestamp or latest record timestamp
Threshold: freshness_hours <= 24
Severity: warning at 20h, error at 24h
Action: alert at 20h drift, fail downstream loads at 24h

Edge case

Holidays or planned downtime: add a maintenance window suppression to avoid noise, with explicit start/end and approval.

Monitoring and maintenance

Version checks: keep a version field for each check. Record changes and reasons.
Tune thresholds: start strict on critical keys; gradually tighten others using observed distributions.
Alert hygiene: route to the right channel; auto-close when recovered.
Ownership: assign on-call rotation or an owner field per check.
Runbooks: each error check needs a short “how to investigate and fix”.
Change management: when upstream schema changes, update check specs and coordinate rollouts.

Lightweight logging schema

check_definitions(check_id, name, run_point, rule_text, threshold_text, severity, owner, version, active_flag)
check_runs(run_id, started_at, finished_at, pipeline, dataset, status)
check_results(run_id, check_id, version, status, failures_count, sample_payload)

Exercises & practice

Tip: The quick test is available to everyone; only logged-in users get saved progress on exercises and the test.

Exercise 1: Design core checks

Given tables staging.customers(id, email, country_code), staging.orders(order_id, customer_id, amount, order_date), define checks for:

Uniqueness: order_id
Referential integrity: orders.customer_id in customers.id
Validity: country_code in [US, CA, MX]
Consistency: amount >= 0 and order_date not in the future

Deliver: spec snippets with threshold and actions. See the exercise card below for sample data and expected outputs.

Exercise 2: Adjust freshness policy

Your partner team moves their job window; data may arrive up to 36 hours late for two weeks. Update your freshness check and alert plan while keeping safety.

Keep early warning.
Avoid false job failures.
Document temporary window with an expiry date.

Checklist before you proceed

Have you set severity based on business impact?
Is there a clear action on failure?
Are checks placed at the earliest sensible pipeline point?
Did you avoid noisy alerts?

Common mistakes and self-check

Too many noisy warnings: prune or raise severity where needed.
Late checks: move them earlier to fail fast.
No owner/runbook: assign and document.
Hard thresholds without data: bootstrap with profiling, then tighten.
Binary pass/fail only: add context metrics (counts, samples) for diagnosis.

Self-check

Can you explain why each check exists in one sentence?
Do you know the action taken when it fails?
Can you find the last 3 failures and what changed since?

Practical projects

Build a check registry: a simple table-backed catalog with 10 checks across 3 datasets, plus a weekly summary report.
Quarantine pipeline: implement a branch that isolates bad rows and produces a daily CSV of offenders.
Freshness dashboard: compute freshness per dataset and highlight red/yellow/green states.

Mini challenge

Design a minimum set of 6 checks for a product catalog (products, categories, prices) that would catch 80% of real-world issues. Include severity and actions. Keep it to 10 minutes.

Hint

PK uniqueness, FK integrity
Non-null names, positive prices
Category in allowed set
Freshness on price updates

Who this is for

ETL Developers building batch or streaming pipelines
Data Engineers responsible for data SLAs
Analytics Engineers maintaining data models

Prerequisites

Comfortable with SQL
Basic ETL/ELT workflow understanding
Familiarity with your team’s alerting and incident process

Learning path

Start: Define critical data elements and map pipeline points.
Next: Implement 5–10 high-impact checks with logging.
Then: Add alerting, quarantine flows, and weekly summaries.
Finally: Tune thresholds, add anomaly detection for drift.

Next steps

Complete the exercises below and compare with solutions.
Take the Quick Test to validate retention.
Pick one production dataset this week and add at least 3 checks.

Menu

Defining And Maintaining Data Checks

Table of Contents