luvv to helpDiscover the Best Free Online Tools

Data Quality Checks

Learn Data Quality Checks for BI Analyst for free: roadmap, examples, subskills, and a skill exam.

Published: December 22, 2025 | Updated: December 22, 2025

Why Data Quality Checks matter for BI Analysts

  • Catch issues early: identify missing events, stale data, broken joins, and unexpected nulls.
  • Explain variance: reconcile metrics with source systems to pinpoint where differences arise.
  • Safeguard consistency: keep the same metric equal across dashboards and teams.
  • Scale responsibly: automate checks and alerts as data volume and scope grow.

Who this is for

  • Aspiring and current BI Analysts who build or maintain dashboards and reports.
  • Data-savvy PMs or Operations Analysts who own KPIs and need reliable numbers.
  • Analytics Engineers who partner with BI to harden metric pipelines.

Prerequisites

  • SQL basics: SELECT, WHERE, GROUP BY, JOIN, window functions (ROW_NUMBER, COUNT OVER).
  • Familiarity with your warehouse and BI tool (e.g., dashboards, data refresh settings).
  • Basic statistics: median vs mean, percentiles, outliers.

Learning path (practical roadmap)

Milestone 1 — Map critical metrics and sources

List your top 5 business metrics (e.g., Orders, Revenue, Active Users). For each metric, note the source system, ingestion path, warehouse tables, and dashboard owners.

  • Deliverable: a one-page metric-to-source map.
  • Mini task: identify the primary keys and time columns per metric.
Milestone 2 — Profile and baseline

Establish normal behavior so you can detect anomalies later.

  • Compute daily row counts, null rates for important columns, and unique-key coverage.
  • Deliverable: a “baseline sheet” with typical ranges (e.g., weekday vs weekend).
Milestone 3 — Freshness and completeness

Ensure data arrives on time and fully.

  • Freshness: compare max event time with current time or schedule time.
  • Completeness: compare counts vs recent average or upstream logs.
Milestone 4 — Duplicates and outliers

Find and handle duplicates, and detect outliers without over-alerting.

  • Use window functions for duplicates by business key + time.
  • Use percentiles/IQR for robust outlier checks.
Milestone 5 — Join integrity and null monitoring

Identify broken joins and missing dimension coverage.

  • Monitor null rates for foreign keys and important attributes.
  • Check join cardinalities and orphaned records.
Milestone 6 — Reconcile across systems and reports

Compare warehouse facts to source-of-truth, and keep dashboards aligned.

  • Drill by date, product, channel to localize differences.
  • Escalate with clear evidence and reproducible queries.
Milestone 7 — Automate checks and alerts

Start with the essentials; tune thresholds to reduce noise.

  • Create scheduled queries that write check results to a table.
  • Set conditional alerts: freshness delay, null spikes, count deltas.
Milestone 8 — Root cause analysis and documentation

Build a repeatable incident flow and document known data limitations.

  • RCA checklist: scope, reproduce, isolate layer (source, ETL, model, viz), fix, prevent.
  • Write a “data caveats” section per metric.

Worked examples

Example 1 — Metric reconciliation with source

Goal: Compare daily revenue between source and warehouse to localize differences.

-- Warehouse metric (fact_orders)
WITH wh AS (
  SELECT order_date::date AS d, SUM(total_amount) AS wh_rev
  FROM fact_orders
  GROUP BY 1
),
-- Source metric (raw.sales_orders)
src AS (
  SELECT created_at::date AS d, SUM(amount) AS src_rev
  FROM raw.sales_orders
  GROUP BY 1
)
SELECT COALESCE(wh.d, src.d) AS date,
       wh.wh_rev,
       src.src_rev,
       (wh.wh_rev - src.src_rev) AS diff
FROM wh
FULL OUTER JOIN src USING(d)
ORDER BY date;
How to use

If diff spikes on a specific date, drill by channel, product, or currency to isolate the scope. Check for late-arriving updates or currency conversion differences.

Example 2 — Event completeness and freshness

Goal: Ensure events are up-to-date and daily counts look normal.

-- Freshness: latest event timestamp
SELECT MAX(event_time) AS latest_event_time
FROM fact_events;

-- Completeness: compare today vs rolling 7-day avg (same weekday)
WITH d AS (
  SELECT event_time::date AS d, COUNT(*) AS cnt
  FROM fact_events
  GROUP BY 1
),
base AS (
  SELECT d,
         cnt,
         AVG(cnt) OVER (
           PARTITION BY EXTRACT(DOW FROM d)
           ORDER BY d ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
         ) AS dow_avg
  FROM d
)
SELECT * FROM base WHERE d = CURRENT_DATE;
How to use

Alert if latest_event_time is beyond your acceptable delay (e.g., > 2 hours). Alert if today’s cnt is < 70% of dow_avg after the expected load time.

Example 3 — Detecting duplicates

Goal: Identify duplicate events by (event_id) or by a composite key.

-- Find duplicate event_ids
WITH marked AS (
  SELECT event_id,
         ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_time) AS rn
  FROM fact_events
)
SELECT * FROM marked WHERE rn > 1;

-- For composite key (user_id, session_id, event_name, event_time::date)
WITH marked AS (
  SELECT user_id, session_id, event_name, event_time::date AS d,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, session_id, event_name, event_time::date
           ORDER BY event_time
         ) AS rn
  FROM fact_events
)
SELECT * FROM marked WHERE rn > 1;
How to use

Decide the business key that should be unique. If duplicates arise from replays, deduplicate at the model stage using rn = 1, and coordinate with ingestion teams.

Example 4 — Null rates and broken joins

Goal: Monitor missing dimension coverage and join issues.

-- Foreign key null rate over time
SELECT order_date::date AS d,
       COUNT(*) AS rows,
       SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS nulls,
       100.0 * SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS null_rate_pct
FROM fact_orders
GROUP BY 1
ORDER BY 1;

-- Broken join check: missing dimension keys
WITH lefted AS (
  SELECT f.order_id, f.customer_id, d.customer_id AS dim_key
  FROM fact_orders f
  LEFT JOIN dim_customer d ON f.customer_id = d.customer_id
)
SELECT COUNT(*) AS missing_customers
FROM lefted
WHERE dim_key IS NULL AND customer_id IS NOT NULL;
How to use

Alert on sudden spikes in null_rate_pct. Investigate: Did dimension keys change? Did the upstream stop populating customer_id?

Example 5 — Outlier detection with percentiles

Goal: Flag unusually large order amounts using robust thresholds.

-- Percentile-based cutoffs (requires percentile_cont or equivalent)
WITH stats AS (
  SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) AS p25,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) AS p75
  FROM fact_orders
),
calc AS (
  SELECT f.*, s.p25, s.p75,
         (s.p75 - s.p25) AS iqr
  FROM fact_orders f CROSS JOIN stats s
)
SELECT *
FROM calc
WHERE total_amount > (p75 + 1.5 * iqr)
   OR total_amount < (p25 - 1.5 * iqr);
Notes

If your warehouse lacks percentile_cont, use its equivalent (e.g., APPROX_PERCENTILE, QUALIFY WITH NTILE buckets). Tune multiplier per business tolerance.

Example 6 — Cross-report consistency

Goal: Ensure a metric matches across two curated datasets feeding different dashboards.

WITH a AS (
  SELECT order_date::date AS d, SUM(total_amount) AS rev_a
  FROM mart_finance_revenue
  GROUP BY 1
), b AS (
  SELECT order_date::date AS d, SUM(total_amount) AS rev_b
  FROM mart_ops_revenue
  GROUP BY 1
)
SELECT COALESCE(a.d, b.d) AS date,
       rev_a, rev_b,
       (rev_a - rev_b) AS delta
FROM a FULL OUTER JOIN b USING(d)
ORDER BY date;
How to use

If delta appears only for certain channels, align business rules (refund handling, FX rates, late corrections) and document the agreed definition.

Drills and micro-exercises

  • Write a query to compute daily null rates for two key columns in your main fact table.
  • Find duplicate rows by your business key and produce a deduped view with ROW_NUMBER.
  • Create a daily count trend for the last 30 days and annotate typical weekday/weekend ranges.
  • Compare two versions of the same metric and produce a table of deltas by date and channel.
  • Compute percentile cutoffs for a metric and list top 50 outliers.
  • Build a small checks table (date, check_name, status, value, threshold, note) and insert results from two checks.

Common mistakes and debugging tips

Mistake: Alerting on raw variance without context

Tip: Compare vs rolling averages or same-weekday baselines. Include expected load windows before alerting.

Mistake: Ignoring surrogate key changes

Tip: Track business identifiers (e.g., external_id) and ensure consistent joins across slowly changing dimensions.

Mistake: Overreliance on mean + 3σ for skewed data

Tip: Use percentiles or IQR for heavy-tailed distributions to reduce false alarms.

Mistake: One-off fixes without documenting

Tip: Add a “Known Data Limitations” note with the scope, impact, and workaround. Keep it close to the metric definition.

Mistake: Not localizing discrepancies

Tip: Slice by date, product, and channel to pinpoint the segment causing gaps before escalating.

Mini project: Data Trust Pack for the Sales Funnel

  1. Define funnel metrics: sessions, adds-to-cart, checkouts, purchases.
  2. Create baseline queries: daily counts and null rates for key IDs (user_id, session_id, order_id).
  3. Freshness check: latest event_time and acceptable delay per table.
  4. Completeness: today’s counts vs rolling 7-day same-weekday baseline.
  5. Duplicates: find duplicates by (order_id) and dedupe view.
  6. Outliers: flag abnormally high order_amount using IQR.
  7. Join integrity: percent of orders missing customers in dim_customer.
  8. Reconciliation: compare purchase count with source system per day.
  9. Automation: write check outputs into a checks table and mark pass/fail.
  10. Documentation: a one-pager listing caveats and what each alert means.
Success criteria
  • At least 7 checks implemented with clear thresholds.
  • One page of caveats and owner contacts.
  • Evidence of a simulated incident and a short RCA note.

Practical projects

  • Marketing Pipeline Guardrails: build completeness and freshness checks for ad spend, clicks, and attributed conversions. Include a weekly reconciliation vs ad platform totals.
  • Finance Revenue Assurance: validate gross vs net revenue across refund logic; document currency conversion assumptions and test edge cases.
  • Product Events Quality Board: track event schema changes, required property coverage, and session stitching health; create an alert for sudden property null spikes.

Subskills

Master these focused subskills to solidify your data quality practice:

  • Metric Reconciliation With Source Systems — Confirm warehouse metrics match the system of record; localize gaps by date/channel.
  • Validating Event Completeness And Freshness — Ensure data arrives on time and fully; baseline expected volumes.
  • Checking Duplicates And Outliers — Keep unique records clean and flag abnormal values safely.
  • Monitoring Null Rates And Broken Joins — Surface missing keys and orphaned rows early.
  • Data Consistency Across Reports — Align definitions and verify equal numbers across dashboards.
  • Automated Checks And Alerts Basics — Schedule queries and trigger targeted, low-noise alerts.
  • Root Cause Analysis For Data Issues — Reproduce, isolate, fix, and prevent incidents methodically.
  • Documenting Known Data Limitations — Communicate caveats, scope, and workarounds to stakeholders.

Next steps

  • Practice: implement two checks in your environment and review results after a full day of data.
  • Take the skill exam below. Everyone can take it for free; if you log in, your progress is saved.
  • Pick one practical project and deliver a concise readme summarizing your checks and thresholds.

Data Quality Checks — Skill Exam

This exam tests your ability to design, run, and interpret data quality checks for BI work. Choose the best answer(s) per question. Some items are multi-select. Everyone can take this exam for free; if you are logged in, your progress and score will be saved.Target time: 20–30 minutes. Passing score: 70%.

12 questions70% to pass

Have questions about Data Quality Checks?

AI Assistant

Ask questions about this tool