luvv to helpDiscover the Best Free Online Tools
Topic 8 of 8

Data Quality Checks For Tracking

Learn Data Quality Checks For Tracking for free with explanations, exercises, and a quick test (for Marketing Analyst).

Published: December 22, 2025 | Updated: December 22, 2025

Why this matters

As a Marketing Analyst, you rely on tracking data (page views, signups, purchases, campaign clicks) to measure performance and make decisions. If tracking is broken, late, duplicated, or missing fields, your metrics and experiments become unreliable. SQL data quality checks help you catch problems fast, protect KPIs, and maintain trust in your dashboards.

  • Real task: Validate yesterday's event volumes before sending a campaign performance report.
  • Real task: Confirm purchases aren't double-counted due to duplicate events.
  • Real task: Ensure required parameters (like order_id, campaign_id) are present and valid.
  • Real task: Verify that tracking timestamps make sense (no future dates, consistent timezone).

Concept explained simply

Data quality checks are quick SQL queries that test your event data for completeness, correctness, and coherence before you use it.

  • Completeness: Do we have all the events and required fields?
  • Correctness: Are values valid, deduplicated, and timestamps sane?
  • Coherence: Do related tables agree (events vs. orders, events vs. campaigns)?
Mental model: The 3C checklist

Think of a small gate your data passes through each morning:

  • Completeness: counts vs. typical trend, % missing required fields.
  • Correctness: duplicate rate, invalid values, time sanity.
  • Coherence: joins to related tables produce minimal orphans and consistent totals.

Core checks you should run

  • Volume and trend: Daily counts by event_name and by key segment (source/medium) to detect spikes/drops.
  • Required fields: % of events missing user/session/order IDs, revenue, or campaign parameters.
  • Duplicates: Same event_id or same user_id + order_id + close timestamps appearing more than once.
  • Timestamp sanity: No future timestamps, and dates within expected ranges.
  • Referential integrity: campaign_id in events exists in campaigns; order_id in events exists in orders.
  • Value validity: source/medium within allowed sets; revenue non-negative; currency in allowed list.
  • Funnel coherence: login → add_to_cart → purchase ratios near typical ranges.
  • Identity stitching sanity: fraction of events with user_id after login where expected.

Worked examples (copy-paste and adapt)

Assume tables: events(event_date, event_time, event_name, event_id, user_id, anonymous_id, session_id, order_id, campaign_id, source, medium, value), orders(order_id, user_id, order_time, order_total), campaigns(campaign_id, status, start_date, end_date).

Example 1: Daily volumes and missing key fields
-- Check purchase volume, missing identity/session
SELECT
  event_date,
  COUNT(*) AS total_events,
  SUM(CASE WHEN COALESCE(user_id, anonymous_id) IS NULL THEN 1 ELSE 0 END) AS no_identity,
  SUM(CASE WHEN session_id IS NULL THEN 1 ELSE 0 END) AS no_session
FROM events
WHERE event_name = 'purchase'
GROUP BY event_date
ORDER BY event_date;

Scan for sudden drops/spikes and rising missing-field counts.

Example 2: Duplicate detection (exact and near-duplicate)
-- Exact duplicates by event_id
SELECT event_id, COUNT(*) AS cnt
FROM events
WHERE event_name = 'purchase'
GROUP BY event_id
HAVING COUNT(*) > 1
ORDER BY cnt DESC;

-- Near-duplicates by business key (user + order) within 10 minutes
WITH d AS (
  SELECT
    user_id,
    order_id,
    event_time,
    LAG(event_time) OVER (PARTITION BY user_id, order_id ORDER BY event_time) AS prev_time
  FROM events
  WHERE event_name = 'purchase'
)
SELECT *
FROM d
WHERE prev_time IS NOT NULL
  AND event_time - prev_time <= INTERVAL '10 minutes';

Use business keys when event_id is unreliable or missing.

Example 3: Required parameters and value validity
-- Required fields for purchase
SELECT
  COUNT(*) AS total,
  SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS missing_order_id,
  SUM(CASE WHEN value IS NULL OR value < 0 THEN 1 ELSE 0 END) AS invalid_value,
  SUM(CASE WHEN source IS NULL OR medium IS NULL THEN 1 ELSE 0 END) AS missing_attribution
FROM events
WHERE event_name = 'purchase';

-- Allowed set for medium
SELECT medium, COUNT(*) AS cnt
FROM events
WHERE event_name IN ('session_start','page_view','purchase')
GROUP BY medium
ORDER BY cnt DESC;
-- Inspect unexpected mediums and update validation rules if needed.
Example 4: Timestamp sanity
-- Future events (allowing 5-minute clock skew)
SELECT COUNT(*) AS future_events
FROM events
WHERE event_time > CURRENT_TIMESTAMP + INTERVAL '5 minutes';

-- Old events suddenly arriving (backfill anomaly)
SELECT event_date, COUNT(*) AS late_arrivals
FROM events
WHERE event_time < CURRENT_DATE - INTERVAL '7 days'
GROUP BY event_date
ORDER BY event_date DESC;
Example 5: Coherence with related tables
-- Orders present in events
SELECT
  COUNT(DISTINCT o.order_id) AS orders_table,
  COUNT(DISTINCT e.order_id) AS orders_in_events,
  COUNT(DISTINCT o.order_id) - COUNT(DISTINCT e.order_id) AS missing_in_events
FROM orders o
LEFT JOIN events e
  ON e.order_id = o.order_id
  AND e.event_name = 'purchase';

-- Orphan campaign IDs in events
SELECT e.campaign_id, COUNT(*) AS cnt
FROM events e
LEFT JOIN campaigns c
  ON c.campaign_id = e.campaign_id
WHERE e.campaign_id IS NOT NULL
  AND c.campaign_id IS NULL
GROUP BY e.campaign_id
ORDER BY cnt DESC;

A simple daily routine (10–15 minutes)

  1. Scan volume trends: counts by event_name vs. 7-day median.
  2. Check required fields: missing user/session/order IDs.
  3. Detect duplicates: by event_id and by business keys.
  4. Validate timestamps: no future events, reasonable lag/backfill.
  5. Cross-check coherence: events vs orders; campaign orphans.
  6. Log findings and notify owners if thresholds are breached.

Exercises

These mirror the tasks in the Exercises section below.

Exercise ex1: Purchase data gate

Write one SQL query (or a short set of queries) to check yesterday's purchase tracking for:

  • Duplicate purchases by event_id.
  • Missing order_id.
  • Future timestamps (allow 5-minute skew).

Expected output: a small result set with counts for each issue and/or a list of offending IDs.

Hints
  • Filter WHERE event_name='purchase' AND event_date = CURRENT_DATE - INTERVAL '1 day'.
  • Use GROUP BY event_id HAVING COUNT(*) > 1 for duplicates.
  • Use SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) for missing fields.
  • Use event_time > CURRENT_TIMESTAMP + INTERVAL '5 minutes' for future time.

Self-checklist

  • ☑ I can produce daily event volumes and spot anomalies.
  • ☑ I can quantify missing required fields and explain impact on KPIs.
  • ☑ I can detect duplicates by event_id and by business keys.
  • ☑ I can flag future/old timestamps and understand ingestion lag.
  • ☑ I can reconcile events to orders and find orphans.

Common mistakes and how to self-check

  • Mistake: Only checking total counts. Fix: Segment by event_name and by source/medium.
  • Mistake: Trusting event_id only. Fix: Add business-key dedup rules.
  • Mistake: Ignoring timezones. Fix: Normalize to UTC in queries; define a small positive skew allowance.
  • Mistake: Comparing unmatched date fields (event_date vs order_time). Fix: Align on a common date grain.
  • Mistake: Hard-coding allowed values and forgetting to update. Fix: Store allowed sets in a small lookup table and join.

Practical projects

  • Build a data quality dashboard: daily charts for volumes, duplicate rates, missing-field rates, and orphan counts.
  • Create a reusable validation view: CASE-based flags per event row for missing/invalid fields.
  • Set up anomaly thresholds: alert when volume deviates > 3 standard deviations from 14-day mean.
  • Design a deduplication query: define primary/secondary keys and keep earliest event per key within 10 minutes.

Who this is for

  • Marketing Analysts who report on campaigns, funnels, and revenue.
  • Anyone querying event data who needs trustworthy metrics.

Prerequisites

  • Basic SQL: SELECT, WHERE, GROUP BY, JOIN.
  • Comfort with dates/timestamps and CASE expressions.

Learning path

  1. Learn the 3C model: completeness, correctness, coherence.
  2. Run the five core checks on one critical event (purchase).
  3. Extend to session_start and signup events.
  4. Automate: turn checks into saved queries or views.
  5. Visualize trends and set thresholds.

Mini challenge

Your signup rate dropped 25% yesterday, but traffic is normal. In 10 minutes, use SQL checks to decide if it's a tracking issue or a real change. Which three checks do you run first, and what would each tell you?

Next steps

  • Generalize your checks into reusable snippets.
  • Add a small reference table of allowed values for source/medium/currency and validate against it.
  • Automate your daily routine and review alerts each morning.

Before you take the Quick Test

The quick test below is available to everyone for free. If you are logged in, your progress will be saved automatically.

Practice Exercises

1 exercises to complete

Instructions

Write SQL to validate yesterday's purchase tracking in the events table. Include three checks:

  • List duplicate purchases by event_id.
  • Return a single row with counts of records missing order_id and with invalid value (NULL or negative).
  • Return a count of events with event_time in the future, allowing 5-minute clock skew.

Assume columns: event_date, event_time, event_name, event_id, order_id, value.

Expected Output
1) A short list of event_id with cnt > 1. 2) One row of counts for missing/invalid fields. 3) One scalar count of future events.

Data Quality Checks For Tracking — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Data Quality Checks For Tracking?

AI Assistant

Ask questions about this tool