How to learn Data Quality Reporting for Data Quality And Validation in ETL Developer for free

Why this matters

As an ETL Developer, you are responsible for delivering trustworthy data. Data Quality Reporting turns raw validation checks into clear, decision-ready signals for engineers, analysts, and business owners. It answers: Is today’s data safe to use? What broke? How bad is it? Who’s impacted? Without reporting, even good checks go unnoticed and issues persist.

Daily ops: Verify pipeline health, detect anomalies, meet SLAs.
Stakeholder trust: Show completeness, accuracy, timeliness, and trends.
Root-cause speed: Pinpoint failing tables/columns and track MTTR.
Compliance: Provide audit-ready evidence of data controls.

Concept explained simply

Data Quality Reporting is the practice of summarizing data quality checks into human-friendly metrics, statuses, and trends. It turns many small checks into a concise picture of quality for each dataset and run.

Mental model

Dimensions: Completeness, Validity, Uniqueness, Consistency, Accuracy, Timeliness/Freshness.
Checks produce metrics (e.g., percent non-null). Thresholds convert metrics into statuses (Green/Yellow/Red).
Runs are snapshots. Trends show direction over time.
Reports answer: what failed, how severe, where, since when.

What a good data quality report includes

Scope: dataset/table, partition/date, environment.
Key metrics: per dimension (e.g., % non-null, duplicate rate, late-arrival minutes).
Thresholds and status: Green/Yellow/Red with clear definitions.
Counts: total rows, failed rows, number of failing checks.
Impact: critical columns flagged; downstream dataset notes.
Trend: last 7–30 runs with sparkline or counts by status.
Action: owner, next steps, ticket reference, time to resolution.

Worked examples

Example 1 — Completeness

Goal: Report email completeness for customer table.

Computation

-- Assuming table: mart.customer (id, email, signup_date)
SELECT
  'mart.customer' AS dataset,
  CURRENT_DATE AS run_date,
  'email' AS column_name,
  COUNT(*) AS row_count,
  SUM(CASE WHEN email IS NOT NULL AND email <> '' THEN 1 ELSE 0 END) AS not_null_count,
  ROUND(100.0 * SUM(CASE WHEN email IS NOT NULL AND email <> '' THEN 1 ELSE 0 END)/COUNT(*), 2) AS pct_not_null
FROM mart.customer;

Report snippet

Dataset: mart.customer | Run: 2026-01-11
Column    RowCount  NotNull  %NotNull  Thresholds(G >=99, Y >=95)  Status
email     120,000   118,200  98.50     G/Y                        Yellow
Reason: Slightly below 99% target. Monitor and open task if trend worsens.

Example 2 — Uniqueness

Goal: Ensure order_id is unique in fact.orders.

Computation

SELECT
  'fact.orders' AS dataset,
  CURRENT_DATE AS run_date,
  COUNT(*) AS row_count,
  COUNT(DISTINCT order_id) AS distinct_orders,
  COUNT(*) - COUNT(DISTINCT order_id) AS duplicate_count,
  ROUND(100.0 * (COUNT(*) - COUNT(DISTINCT order_id))/COUNT(*), 4) AS duplicate_rate_pct
FROM fact.orders;

Report snippet

Dataset: fact.orders | Run: 2026-01-11
Rows: 2,450,000  Duplicates: 150 (0.0061%)
Thresholds: Green <= 0.001%, Yellow <= 0.01%, else Red
Status: Yellow | Impact: Analysts using order-level KPIs may be affected.

Example 3 — Timeliness/Freshness

Goal: Confirm daily partition arrives before 06:00 UTC.

Computation

-- Assume we log partition arrival times in meta.partition_log
SELECT
  'raw.clicks' AS dataset,
  run_date,
  MIN(arrival_at_utc) AS first_arrival_utc,
  CASE WHEN MIN(arrival_at_utc) <= TIME '06:00:00' THEN 'Green'
       WHEN MIN(arrival_at_utc) <= TIME '07:00:00' THEN 'Yellow'
       ELSE 'Red' END AS status
FROM meta.partition_log
WHERE dataset = 'raw.clicks' AND run_date = CURRENT_DATE
GROUP BY run_date;

Report snippet

Dataset: raw.clicks | Run: 2026-01-11
First partition arrival: 06:23 UTC
SLA: <= 06:00 Green, <= 07:00 Yellow, else Red
Status: Yellow | Action: Review upstream source delay.

Minimal reporting pipeline (step-by-step)

Define KPIs: Pick 3–5 core metrics per critical dataset (e.g., completeness on key columns, duplicate rate, row_count delta).
Set thresholds: Green/Yellow/Red per metric with business input.
Compute metrics: SQL or Python. Store one row per metric per run.
Store: Write to a narrow table meta.dq_metrics (dataset, run_id, metric, value, status, details).
Summarize: Build daily summaries meta.dq_run_summary (counts by status, failed checks).
Notify: If any Red on critical datasets, create alert/ticket.
Trend: Keep 30-day history for each metric to spot drifts.

Suggested schemas (copy/paste)

-- Per-run header
CREATE TABLE IF NOT EXISTS meta.dq_run (
  run_id STRING,
  run_date DATE,
  environment STRING,
  started_at TIMESTAMP,
  ended_at TIMESTAMP,
  status STRING
);

-- Long/narrow metrics
CREATE TABLE IF NOT EXISTS meta.dq_metrics (
  run_id STRING,
  dataset STRING,
  table_name STRING,
  column_name STRING,
  metric STRING,            -- e.g., pct_not_null, duplicate_rate_pct
  value DOUBLE,
  threshold_green DOUBLE,   -- nullable; interpret per metric
  threshold_yellow DOUBLE,  -- nullable
  status STRING,            -- Green/Yellow/Red
  details STRING
);

-- Summaries for quick dashboards
CREATE TABLE IF NOT EXISTS meta.dq_run_summary (
  run_id STRING,
  dataset STRING,
  green_count INT,
  yellow_count INT,
  red_count INT,
  failed_checks INT,
  notes STRING
);

Practice & exercises

Tip: The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1: Aggregate Completeness by Column

Compute percent non-null for email and signup_date in mart.customer and output one row per column with row_count, not_null_count, and pct_not_null (rounded to 2 decimals).

Expected output structure

column_name | row_count | not_null_count | pct_not_null
email       | 120000    | 118200         | 98.50
signup_date | 120000    | 120000         | 100.00

Hints

Use SUM(CASE WHEN col IS NOT NULL AND col <> '' THEN 1 ELSE 0 END).
GROUP BY column_name; you can UNION ALL two SELECTs if needed.

Exercise 2: Status Summary for Last 7 Runs

From meta.dq_metrics, build a summary for dataset fact.orders showing, per run_date, total checks, failed checks (status = 'Red'), and an overall run status: Green if all Green, Yellow if any Yellow and none Red, Red if any Red.

Expected output structure

run_date   | total_checks | failed_checks | overall_status
2026-01-05 | 12           | 0             | Green
2026-01-06 | 12           | 1             | Red
2026-01-07 | 12           | 0             | Yellow

Hints

Aggregate counts by run_id/run_date.
Derive overall_status with CASE on counts of Red and Yellow.

Submission checklist

Results include dataset/table and run date.
All percentages are rounded consistently.
Status logic matches thresholds.
Trend covers at least 7 runs.

Common mistakes and how to self-check

Only row counts, no context. Fix: Always include thresholds and status per metric.
Hard-coded thresholds hidden in code. Fix: Store thresholds alongside metrics so reports are reproducible.
Reporting on averages only. Fix: Show distribution or separate metrics per column; averages can hide extremes.
No trend. Fix: Include last 7–30 runs; alert on sustained degradation.
Ignoring impact. Fix: Tag critical columns/datasets and highlight when they fail.
Not tracking fixes. Fix: Add ticket/owner and measure time to recovery.

Practical projects

Build a daily DQ dashboard: Green/Yellow/Red by dataset, with drill-down to failing metrics.
Auto-SLA report: Timeliness checks for all daily partitions, send a summary at 07:05 UTC.
Quality trend notebook: 30-day trend of completeness for top 5 critical columns, with rolling averages.

Who this is for

ETL/ELT developers who run production pipelines.
Data analysts who need reliable datasets and quick validation signals.
Data leads who must evidence data controls to stakeholders or auditors.

Prerequisites

Basic SQL (GROUP BY, CASE, aggregation).
Understanding of your warehouse tables and daily run cadence.
Familiarity with common data quality dimensions.

Learning path

List datasets and critical columns; agree on thresholds with stakeholders.
Implement completeness, uniqueness, and timeliness checks.
Store metrics in a narrow dq_metrics table per run.
Publish daily run summaries and a 7-day trend.
Add accuracy/consistency checks as you mature.

Next steps

Finish the exercises, then take the quick test below.
Expand coverage to your top 5 revenue-impacting datasets.
Introduce auto-ticketing for any Red on critical data.

Mini challenge

Your sales dashboard shows a sudden drop. Yesterday’s DQ report is Green overall, but duplicate_rate_pct is near the Yellow threshold and timeliness is Yellow. In one paragraph, write what you would communicate to stakeholders (include suspected cause, impact, and the next diagnostic you will run). Keep it concise and actionable.

Menu

Data Quality Reporting

Table of Contents

Why this matters

Concept explained simply

Mental model

What a good data quality report includes

Worked examples

Example 1 — Completeness

Example 2 — Uniqueness

Example 3 — Timeliness/Freshness

Minimal reporting pipeline (step-by-step)

Practice & exercises

Exercise 1: Aggregate Completeness by Column

Exercise 2: Status Summary for Last 7 Runs

Submission checklist

Common mistakes and how to self-check

Practical projects

Who this is for

Prerequisites

Learning path

Next steps

Mini challenge

Practice Exercises

Aggregate Completeness by Column

Instructions

Expected Output

Status Summary for Last 7 Runs

Data Quality Reporting — Quick Test

Have questions about Data Quality Reporting?

AI Assistant