Why this matters
BI Analysts rely on timely, complete events to power dashboards, experiments, and decisions. If events are missing or late, KPIs swing, A/B tests become unreliable, and stakeholders lose trust. Validating completeness and freshness is how you catch issues before they cause bad decisions.
- Daily ops: Verify all required events arrived for yesterday and today.
- Experiment support: Ensure exposure, conversion, and attribution events are complete and current.
- Incident triage: Detect late-arriving data and backfills; decide when dashboards are “green.”
Concept explained simply
Completeness answers: “Did everything we expected arrive?” Freshness answers: “How recent is the latest data?”
- Completeness = volume and coverage compared to a reasonable baseline (e.g., same weekday median of last 4 weeks).
- Freshness = delay between “now” and the most recent event timestamp (or 95th percentile lag between event time and ingestion time).
Mental model
Think of your data like a daily delivery truck:
- Completeness: Did all the packages arrive (counts per source/platform/entity)?
- Freshness: Did the truck arrive on time (latest event timestamp close to now)?
When freshness is off, completeness often looks low early in the day and recovers after a late delivery. When completeness is truly low, counts stay depressed even after freshness returns to normal.
Core metrics and formulas
- Completeness ratio (by dimension): current_volume / baseline_volume.
- Baseline examples:
- Median of same weekday over last 4–6 weeks.
- 7-day moving average for rolling checks.
- Freshness (simple): now() - max(event_ts).
- Freshness (robust): percentile_95(ingestion_ts - event_ts) by source/day.
SQL templates
-- 1) Completeness ratio by day and platform
WITH baseline AS (
SELECT platform,
DATE(event_ts) AS d,
COUNT(*) AS cnt
FROM events
WHERE event_name = 'purchase'
GROUP BY 1,2
), ref AS (
SELECT platform,
EXTRACT(DAYOFWEEK FROM d) AS dow,
PERCENTILE_CONT(cnt, 0.5) OVER (PARTITION BY platform, EXTRACT(DAYOFWEEK FROM d)) AS median_same_dow
FROM baseline
)
SELECT b.platform,
b.d,
b.cnt AS current_cnt,
r.median_same_dow AS baseline_cnt,
SAFE_DIVIDE(b.cnt, r.median_same_dow) AS completeness_ratio
FROM baseline b
JOIN ref r USING (platform, d);
-- 2) Freshness: time since latest event
SELECT platform,
TIMESTAMPDIFF(MINUTE, MAX(event_ts), CURRENT_TIMESTAMP) AS freshness_minutes
FROM events
WHERE DATE(event_ts) = CURRENT_DATE
GROUP BY platform;
-- 3) Freshness: event-to-ingestion lag percentiles
SELECT platform,
DATE(event_ts) AS d,
PERCENTILE_CONT(TIMESTAMPDIFF(MINUTE, event_ts, ingestion_ts), 0.5) AS p50_lag_min,
PERCENTILE_CONT(TIMESTAMPDIFF(MINUTE, event_ts, ingestion_ts), 0.95) AS p95_lag_min
FROM events
GROUP BY 1,2;
Worked examples
Example 1: Web events completeness dips, freshness delayed
At 09:00, sessions are 40% below baseline. Freshness shows 120 minutes. Interpretation: pipeline delay. Action: communicate temporary delay, hold dashboards until freshness <= SLA (e.g., 30–60 min). Expect counts to recover.
Example 2: Purchase events missing only on iOS
Android ratio ~1.0, iOS ratio 0.6 with normal freshness. Interpretation: iOS SDK misconfigured after release. Action: file incident to mobile team, backfill if available, annotate dashboards.
Example 3: Backfill spike
Completeness ratio > 1.5 for yesterday; p95 lag 300 minutes. Interpretation: late-arriving backfill. Action: mark as backfill window; remind consumers that day-level metrics stabilized only after backfill completion.
Step-by-step validation workflow
- Check global freshness: time since latest event by key sources/platforms.
- Check completeness ratios overall and by critical dimensions (platform, country, event_name).
- If freshness is high, wait or page ingestion owners depending on SLA. If freshness is normal but completeness low, isolate the segment driving the drop.
- Inspect recent deploys, ETL changes, or tracking changes.
- Document incident, expected impact, and ETA. Annotate dashboards if needed.
Quick interactive checks
- [ ] I calculated freshness (minutes since latest event) for today.
- [ ] I computed completeness ratios versus a relevant baseline.
- [ ] I sliced by platform/source to identify localized issues.
- [ ] I differentiated real loss vs. late-arrival by comparing freshness vs. completeness.
- [ ] I noted the SLA and whether we are within tolerance.
Exercises
Everyone can do the exercises and test for free. If you are logged in, your progress will be saved.
Exercise 1 — Compute completeness ratio by day
Table: daily_event_counts(date, event_name, platform, count, baseline_7d_median). Calculate completeness_ratio and a flag low_completeness when < 0.9.
SELECT date,
event_name,
platform,
count,
baseline_7d_median,
SAFE_DIVIDE(count, baseline_7d_median) AS completeness_ratio,
CASE WHEN SAFE_DIVIDE(count, baseline_7d_median) < 0.9 THEN true ELSE false END AS low_completeness
FROM daily_event_counts;
Exercise 2 — Freshness SLA with p95 lag
Table: events(event_ts, ingestion_ts, source). Compute p95 lag in minutes by source for today and decide if each source meets an SLA of 60 minutes.
SELECT source,
PERCENTILE_CONT(TIMESTAMPDIFF(MINUTE, event_ts, ingestion_ts), 0.95) AS p95_lag_min,
CASE WHEN PERCENTILE_CONT(TIMESTAMPDIFF(MINUTE, event_ts, ingestion_ts), 0.95) <= 60 THEN 'OK' ELSE 'SLOW' END AS sla_status
FROM events
WHERE DATE(event_ts) = CURRENT_DATE
GROUP BY source;
Common mistakes and self-check
- Using last week’s raw count as baseline without weekday effects. Self-check: compare vs. same weekday median.
- Judging completeness before midday with batch pipelines. Self-check: confirm freshness within SLA first.
- Not segmenting by platform/source. Self-check: slice ratios; look for localized drops.
- Timezone mix-ups (UTC vs. local). Self-check: align event_ts and business day definition.
- Counting ingestion_ts as event time. Self-check: always compute lag from event_ts to ingestion_ts.
- Ignoring deduplication. Self-check: verify unique event_id where applicable.
Practical projects
- Build a daily Completeness & Freshness report with thresholds and red/amber/green flags by platform and event_name.
- Create a baseline generator that computes same-weekday medians for the last 6 weeks per dimension.
- Set up a dashboard tile that shows p95 lag trend and the most delayed sources for the past 7 days.
Who this is for, prerequisites, learning path
Who this is for
- BI Analysts and Analytics Engineers responsible for dashboards and data reliability.
- Data-savvy PMs validating experiment readiness.
Prerequisites
- Comfort with SQL (GROUP BY, window functions, percentiles).
- Basic understanding of ETL/ELT and event schemas.
Learning path
- Start with data timeliness concepts (this lesson).
- Learn anomaly detection basics (moving averages, percentiles).
- Implement monitoring queries and add thresholds.
- Automate alerts from your chosen BI tool or scheduler.
Next steps
- Finish the exercises and take the quick test below.
- Apply to one dataset you maintain; document SLA and baseline choices.
- Schedule your checks daily. Iterate thresholds after a week of observations.
Mini challenge
It’s 10:30. Global completeness for “signup” is 0.65 vs baseline. Freshness p95 lag is 12 minutes (SLA 45). Android completeness is 1.02; iOS is 0.48; Web is 0.99. What do you conclude and what is your first action?
Suggested approach
Conclusion: iOS-specific tracking issue (freshness normal, localized completeness drop). First action: open incident with iOS team, include example payloads and version, start scoping when the drop began; check for backfill capability.