How to learn Data Quality Checks for BI Analyst for free

Why Data Quality Checks matter for BI Analysts

Catch issues early: identify missing events, stale data, broken joins, and unexpected nulls.
Explain variance: reconcile metrics with source systems to pinpoint where differences arise.
Safeguard consistency: keep the same metric equal across dashboards and teams.
Scale responsibly: automate checks and alerts as data volume and scope grow.

Who this is for

Aspiring and current BI Analysts who build or maintain dashboards and reports.
Data-savvy PMs or Operations Analysts who own KPIs and need reliable numbers.
Analytics Engineers who partner with BI to harden metric pipelines.

Prerequisites

SQL basics: SELECT, WHERE, GROUP BY, JOIN, window functions (ROW_NUMBER, COUNT OVER).
Familiarity with your warehouse and BI tool (e.g., dashboards, data refresh settings).
Basic statistics: median vs mean, percentiles, outliers.

Learning path (practical roadmap)

Milestone 1 — Map critical metrics and sources

List your top 5 business metrics (e.g., Orders, Revenue, Active Users). For each metric, note the source system, ingestion path, warehouse tables, and dashboard owners.

Deliverable: a one-page metric-to-source map.
Mini task: identify the primary keys and time columns per metric.

Milestone 2 — Profile and baseline

Establish normal behavior so you can detect anomalies later.

Compute daily row counts, null rates for important columns, and unique-key coverage.
Deliverable: a “baseline sheet” with typical ranges (e.g., weekday vs weekend).

Milestone 3 — Freshness and completeness

Ensure data arrives on time and fully.

Freshness: compare max event time with current time or schedule time.
Completeness: compare counts vs recent average or upstream logs.

Milestone 4 — Duplicates and outliers

Find and handle duplicates, and detect outliers without over-alerting.

Use window functions for duplicates by business key + time.
Use percentiles/IQR for robust outlier checks.

Milestone 5 — Join integrity and null monitoring

Identify broken joins and missing dimension coverage.

Monitor null rates for foreign keys and important attributes.
Check join cardinalities and orphaned records.

Milestone 6 — Reconcile across systems and reports

Compare warehouse facts to source-of-truth, and keep dashboards aligned.

Drill by date, product, channel to localize differences.
Escalate with clear evidence and reproducible queries.

Milestone 7 — Automate checks and alerts

Start with the essentials; tune thresholds to reduce noise.

Create scheduled queries that write check results to a table.
Set conditional alerts: freshness delay, null spikes, count deltas.

Milestone 8 — Root cause analysis and documentation

Build a repeatable incident flow and document known data limitations.

RCA checklist: scope, reproduce, isolate layer (source, ETL, model, viz), fix, prevent.
Write a “data caveats” section per metric.

Worked examples

Example 1 — Metric reconciliation with source

Goal: Compare daily revenue between source and warehouse to localize differences.

-- Warehouse metric (fact_orders)
WITH wh AS (
  SELECT order_date::date AS d, SUM(total_amount) AS wh_rev
  FROM fact_orders
  GROUP BY 1
),
-- Source metric (raw.sales_orders)
src AS (
  SELECT created_at::date AS d, SUM(amount) AS src_rev
  FROM raw.sales_orders
  GROUP BY 1
)
SELECT COALESCE(wh.d, src.d) AS date,
       wh.wh_rev,
       src.src_rev,
       (wh.wh_rev - src.src_rev) AS diff
FROM wh
FULL OUTER JOIN src USING(d)
ORDER BY date;

How to use

If diff spikes on a specific date, drill by channel, product, or currency to isolate the scope. Check for late-arriving updates or currency conversion differences.

Example 2 — Event completeness and freshness

Goal: Ensure events are up-to-date and daily counts look normal.

-- Freshness: latest event timestamp
SELECT MAX(event_time) AS latest_event_time
FROM fact_events;

-- Completeness: compare today vs rolling 7-day avg (same weekday)
WITH d AS (
  SELECT event_time::date AS d, COUNT(*) AS cnt
  FROM fact_events
  GROUP BY 1
),
base AS (
  SELECT d,
         cnt,
         AVG(cnt) OVER (
           PARTITION BY EXTRACT(DOW FROM d)
           ORDER BY d ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
         ) AS dow_avg
  FROM d
)
SELECT * FROM base WHERE d = CURRENT_DATE;

How to use

Alert if latest_event_time is beyond your acceptable delay (e.g., > 2 hours). Alert if today’s cnt is < 70% of dow_avg after the expected load time.

Example 3 — Detecting duplicates

Goal: Identify duplicate events by (event_id) or by a composite key.

-- Find duplicate event_ids
WITH marked AS (
  SELECT event_id,
         ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY event_time) AS rn
  FROM fact_events
)
SELECT * FROM marked WHERE rn > 1;

-- For composite key (user_id, session_id, event_name, event_time::date)
WITH marked AS (
  SELECT user_id, session_id, event_name, event_time::date AS d,
         ROW_NUMBER() OVER (
           PARTITION BY user_id, session_id, event_name, event_time::date
           ORDER BY event_time
         ) AS rn
  FROM fact_events
)
SELECT * FROM marked WHERE rn > 1;

How to use

Decide the business key that should be unique. If duplicates arise from replays, deduplicate at the model stage using rn = 1, and coordinate with ingestion teams.

Example 4 — Null rates and broken joins

Goal: Monitor missing dimension coverage and join issues.

-- Foreign key null rate over time
SELECT order_date::date AS d,
       COUNT(*) AS rows,
       SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS nulls,
       100.0 * SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) / COUNT(*) AS null_rate_pct
FROM fact_orders
GROUP BY 1
ORDER BY 1;

-- Broken join check: missing dimension keys
WITH lefted AS (
  SELECT f.order_id, f.customer_id, d.customer_id AS dim_key
  FROM fact_orders f
  LEFT JOIN dim_customer d ON f.customer_id = d.customer_id
)
SELECT COUNT(*) AS missing_customers
FROM lefted
WHERE dim_key IS NULL AND customer_id IS NOT NULL;

How to use

Alert on sudden spikes in null_rate_pct. Investigate: Did dimension keys change? Did the upstream stop populating customer_id?

Example 5 — Outlier detection with percentiles

Goal: Flag unusually large order amounts using robust thresholds.

-- Percentile-based cutoffs (requires percentile_cont or equivalent)
WITH stats AS (
  SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) AS p25,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) AS p75
  FROM fact_orders
),
calc AS (
  SELECT f.*, s.p25, s.p75,
         (s.p75 - s.p25) AS iqr
  FROM fact_orders f CROSS JOIN stats s
)
SELECT *
FROM calc
WHERE total_amount > (p75 + 1.5 * iqr)
   OR total_amount < (p25 - 1.5 * iqr);

Notes

If your warehouse lacks percentile_cont, use its equivalent (e.g., APPROX_PERCENTILE, QUALIFY WITH NTILE buckets). Tune multiplier per business tolerance.

Example 6 — Cross-report consistency

Goal: Ensure a metric matches across two curated datasets feeding different dashboards.

WITH a AS (
  SELECT order_date::date AS d, SUM(total_amount) AS rev_a
  FROM mart_finance_revenue
  GROUP BY 1
), b AS (
  SELECT order_date::date AS d, SUM(total_amount) AS rev_b
  FROM mart_ops_revenue
  GROUP BY 1
)
SELECT COALESCE(a.d, b.d) AS date,
       rev_a, rev_b,
       (rev_a - rev_b) AS delta
FROM a FULL OUTER JOIN b USING(d)
ORDER BY date;

How to use

If delta appears only for certain channels, align business rules (refund handling, FX rates, late corrections) and document the agreed definition.

Drills and micro-exercises

Write a query to compute daily null rates for two key columns in your main fact table.
Find duplicate rows by your business key and produce a deduped view with ROW_NUMBER.
Create a daily count trend for the last 30 days and annotate typical weekday/weekend ranges.
Compare two versions of the same metric and produce a table of deltas by date and channel.
Compute percentile cutoffs for a metric and list top 50 outliers.
Build a small checks table (date, check_name, status, value, threshold, note) and insert results from two checks.

Common mistakes and debugging tips

Mistake: Alerting on raw variance without context

Tip: Compare vs rolling averages or same-weekday baselines. Include expected load windows before alerting.

Mistake: Ignoring surrogate key changes

Tip: Track business identifiers (e.g., external_id) and ensure consistent joins across slowly changing dimensions.

Mistake: Overreliance on mean + 3σ for skewed data

Tip: Use percentiles or IQR for heavy-tailed distributions to reduce false alarms.

Mistake: One-off fixes without documenting

Tip: Add a “Known Data Limitations” note with the scope, impact, and workaround. Keep it close to the metric definition.

Mistake: Not localizing discrepancies

Tip: Slice by date, product, and channel to pinpoint the segment causing gaps before escalating.

Mini project: Data Trust Pack for the Sales Funnel

Define funnel metrics: sessions, adds-to-cart, checkouts, purchases.
Create baseline queries: daily counts and null rates for key IDs (user_id, session_id, order_id).
Freshness check: latest event_time and acceptable delay per table.
Completeness: today’s counts vs rolling 7-day same-weekday baseline.
Duplicates: find duplicates by (order_id) and dedupe view.
Outliers: flag abnormally high order_amount using IQR.
Join integrity: percent of orders missing customers in dim_customer.
Reconciliation: compare purchase count with source system per day.
Automation: write check outputs into a checks table and mark pass/fail.
Documentation: a one-pager listing caveats and what each alert means.

Success criteria

At least 7 checks implemented with clear thresholds.
One page of caveats and owner contacts.
Evidence of a simulated incident and a short RCA note.

Practical projects

Marketing Pipeline Guardrails: build completeness and freshness checks for ad spend, clicks, and attributed conversions. Include a weekly reconciliation vs ad platform totals.
Finance Revenue Assurance: validate gross vs net revenue across refund logic; document currency conversion assumptions and test edge cases.
Product Events Quality Board: track event schema changes, required property coverage, and session stitching health; create an alert for sudden property null spikes.

Subskills

Master these focused subskills to solidify your data quality practice:

Metric Reconciliation With Source Systems — Confirm warehouse metrics match the system of record; localize gaps by date/channel.
Validating Event Completeness And Freshness — Ensure data arrives on time and fully; baseline expected volumes.
Checking Duplicates And Outliers — Keep unique records clean and flag abnormal values safely.
Monitoring Null Rates And Broken Joins — Surface missing keys and orphaned rows early.
Data Consistency Across Reports — Align definitions and verify equal numbers across dashboards.
Automated Checks And Alerts Basics — Schedule queries and trigger targeted, low-noise alerts.
Root Cause Analysis For Data Issues — Reproduce, isolate, fix, and prevent incidents methodically.
Documenting Known Data Limitations — Communicate caveats, scope, and workarounds to stakeholders.

Next steps

Practice: implement two checks in your environment and review results after a full day of data.
Take the skill exam below. Everyone can take it for free; if you log in, your progress is saved.
Pick one practical project and deliver a concise readme summarizing your checks and thresholds.

Menu

Data Quality Checks

Table of Contents

Why Data Quality Checks matter for BI Analysts

Who this is for

Prerequisites

Learning path (practical roadmap)

Worked examples

Example 1 — Metric reconciliation with source

Example 2 — Event completeness and freshness

Example 3 — Detecting duplicates

Example 4 — Null rates and broken joins

Example 5 — Outlier detection with percentiles

Example 6 — Cross-report consistency

Drills and micro-exercises

Common mistakes and debugging tips

Mini project: Data Trust Pack for the Sales Funnel

Practical projects

Subskills

Next steps

Data Quality Checks — Skill Exam

Topics

Metric Reconciliation With Source Systems

Validating Event Completeness And Freshness

Checking Duplicates And Outliers

Monitoring Null Rates And Broken Joins

Data Consistency Across Reports

Automated Checks And Alerts Basics

Root Cause Analysis For Data Issues

Documenting Known Data Limitations

Have questions about Data Quality Checks?

AI Assistant