luvv to helpDiscover the Best Free Online Tools
Topic 7 of 8

Data Quality Checks

Learn Data Quality Checks for free with explanations, exercises, and a quick test (for Product Analyst).

Published: December 22, 2025 | Updated: December 22, 2025

Why this matters

As a Product Analyst, your charts, experiments, and business decisions are only as good as the data behind them. Data quality checks help you catch issues early: missing rows after a pipeline change, unexpected spikes of nulls, duplicate events inflating metrics, or broken joins causing orphaned records.

  • Daily dashboards: Confirm counts and key metrics haven’t silently drifted.
  • Experiment analysis: Ensure assignment and events are complete and deduplicated.
  • Revenue reporting: Validate totals, currencies, and outliers before leadership sees them.
Real-world examples you’ll face
  • New ETL step halves yesterday’s orders — detect with row-count reconciliation.
  • Client-side bug doubles event sends — catch via duplicate-key checks.
  • Tracking change introduces null user_id — spot via null-rate thresholds.

Concept explained simply

Data quality checks are short SQL queries that verify assumptions about your data: uniqueness, completeness, validity, and consistency. Think of them as small unit tests for your warehouse tables.

Mental model

  • Uniqueness: “Each primary key appears once.”
  • Completeness: “We didn’t lose rows vs. source.”
  • Validity: “Values are in expected ranges/enums.”
  • Consistency: “Foreign keys match parent tables.”
Pro tip

Prefer stable checks (counts, distinct IDs, null rates) over fragile ones. Start small and automate the most business-critical checks first.

Worked examples

1) Duplicate primary keys

-- Find duplicate order IDs
SELECT id AS duplicate_id, COUNT(*) AS cnt
FROM orders_raw
GROUP BY id
HAVING COUNT(*) > 1;

Why: Duplicates can inflate metrics. This query lists any IDs appearing more than once.

2) Row-count reconciliation by day

WITH raw_daily AS (
  SELECT CAST(created_at AS DATE) AS d, COUNT(*) AS raw_count
  FROM orders_raw
  GROUP BY 1
), clean_daily AS (
  SELECT order_date AS d, COUNT(*) AS clean_count
  FROM orders_clean
  GROUP BY 1
)
SELECT COALESCE(r.d, c.d) AS d,
       COALESCE(r.raw_count, 0) AS raw_count,
       COALESCE(c.clean_count, 0) AS clean_count,
       COALESCE(c.clean_count, 0) - COALESCE(r.raw_count, 0) AS diff
FROM raw_daily r
FULL OUTER JOIN clean_daily c USING (d)
ORDER BY d;

Why: Ensures your clean table didn’t lose or gain rows compared to the source.

3) Null-rate and value-range checks

-- Null-rate check for user_id
SELECT
  COUNT(*) AS total_rows,
  SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) AS null_user_id,
  100.0 * SUM(CASE WHEN user_id IS NULL THEN 1 ELSE 0 END) / NULLIF(COUNT(*),0) AS null_pct
FROM events;

-- Value range check for order amount
SELECT COUNT(*) AS out_of_range
FROM orders_clean
WHERE amount_usd < 0 OR amount_usd > 100000; -- adjust expected range

Why: Sudden null spikes or impossible values often signal ingestion or transformation issues.

Who this is for

  • Product Analysts who own dashboards and experiments.
  • Anyone writing SQL for reports or metrics and needs reliable data.

Prerequisites

  • Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
  • Comfort with reading table schemas and identifying keys.

Learning path

  1. Identify critical tables: orders, events, users.
  2. Define expectations: primary keys unique, no orphaned FKs, stable daily counts.
  3. Write checks: duplicates, nulls, ranges, referential integrity, reconciliations.
  4. Make thresholds explicit: e.g., null_user_id < 1%.
  5. Automate: schedule daily queries; review failures in a simple report.

Practice (mirrors the exercises below)

Exercise 1 — Detect duplicates and nulls in a primary key column

Tables: orders_raw(id, user_id, created_at, amount, status)

  1. Write a summary query returning total rows, distinct ids, and null ids.
  2. List duplicate ids with their counts.
Tip

Use COUNT(*), COUNT(DISTINCT id), and GROUP BY id HAVING COUNT(*) > 1.

Exercise 2 — Reconcile row counts by day between source and clean

Tables: orders_raw(id, created_at), orders_clean(order_id, order_date)

  1. Produce daily counts from both tables and the difference.
  2. Return only days where diff != 0.
Tip

DATE/CAST timestamps to day, then FULL OUTER JOIN and COALESCE.

Checklist before you say “done”

  • I can find duplicates with GROUP BY + HAVING COUNT(*) > 1.
  • I can reconcile daily row counts between two tables.
  • I can compute null rates and set thresholds.
  • I can detect orphaned records with LEFT JOIN ... IS NULL.
  • I can validate value ranges and enums.

Common mistakes and self-check

  • Only checking totals: Totals can hide problems. Add daily breakdowns and distinct-ID checks.
  • Forgetting COALESCE in reconciliations: Missing dates vanish. Always COALESCE counts to 0.
  • Comparing timestamps at full precision: Truncate to day when validating daily counts.
  • Not filtering soft-deleted rows: Apply business rules (e.g., status NOT IN ('canceled')).
  • Ignoring keys: Define PK/FK expectations before writing checks.
How to self-check fast
  • Run your checks for the last 7–30 days and scan for non-zero diffs.
  • Compare before/after a known pipeline change.
  • Validate one day manually in both source and target.

Practical projects

  • Daily quality dashboard: counts, distinct IDs, null rates, and diffs for top-5 tables.
  • Funnel integrity report: step-wise drop-offs plus duplicate and null checks per step.
  • Revenue sanity checks: sum(amount), negative value scan, extreme outlier detection.

Mini challenge

Write a single query to find event days where the share of null user_id is above 2% and the total events increased by over 10% vs. the 7-day average (use a simple average of the previous 7 days). Return day, null_pct, total_events, and pct_change_vs_7d_avg.

Hint

Compute daily aggregates, then join to a 7-day rolling window average using a window function on date.

Next steps

  • Extend checks to referential integrity (users to events, orders to users).
  • Add anomaly thresholds for key KPIs (e.g., conversion, AOV) alongside raw counts.
  • Automate: schedule queries and review a daily “green/red” status report.
Progress & test

The quick test is available to everyone. If you log in, your progress will be saved automatically.

Practice Exercises

2 exercises to complete

Instructions

Table: orders_raw(id, user_id, created_at, amount, status)

  1. Write a summary query returning: total_rows, distinct_ids, null_ids.
  2. Write a second query listing duplicate ids with their counts.
Expected Output
1) One row with columns: total_rows, distinct_ids, null_ids (for clean data: distinct_ids = total_rows and null_ids = 0). 2) Zero rows if no duplicates; otherwise id + cnt for duplicates.

Data Quality Checks — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Data Quality Checks?

AI Assistant

Ask questions about this tool