Why this matters
Dashboards, forecasts, and product decisions depend on timely and complete data. Freshness checks catch stale pipelines before stakeholders do. Volume checks catch silent drops, duplicates, or runaway spikes that skew KPIs.
- Daily revenue dashboard not updating? A freshness check should alert you within minutes.
- Campaign events suddenly drop 40%? A volume check should flag the anomaly before reporting is impacted.
- Backfills and seasonality? Good checks prevent false alarms and unnecessary on-call noise.
Who this is for
- Analytics Engineers building reliable data models and dashboards.
- Data/BI Engineers maintaining ingestion and transformation jobs.
- Data Analysts who want confidence in metric recency and completeness.
Prerequisites
- Comfort with SQL (window functions and date/time operations).
- Basic understanding of your ELT schedule (when jobs run and where timestamps come from).
- Awareness of business SLAs (e.g., “orders model should be < 30 minutes old”).
Concept explained simply
Freshness check: measures how old your latest data is compared to now. If the newest timestamp is older than your SLA (e.g., 30 minutes), fail the check.
Volume check: measures how much data arrived over a window (e.g., last hour/day) and compares it to an expected baseline (e.g., historical average or median). If the change exceeds your tolerance (e.g., ±25%), fail the check.
Mental model
- Freshness = how recently your dataset was updated (recency of max timestamp).
- Volume = how many rows you received (counts compared to history).
- Baseline = a stable reference (e.g., 7-day median; same weekday average for seasonality).
- Tolerance = acceptable deviation (e.g., ±20%); wider during known spikes/backfills.
Key decisions before you check
- Which timestamp? Prefer business event time when SLAs require business recency; use load/ingestion time when you only need pipeline timeliness.
- Windowing: compare last hour/day to historical windows (7–28 days). Consider seasonality (weekday/weekend).
- Grace periods: allow a small delay (e.g., 5–10 minutes) to avoid alerting before jobs finish.
- Alert policy: page only on repeated failures or large deviations; log warnings on mild deviations.
Implementation steps
- Select timestamp and partitions: pick the column that represents the check’s intent (event_time vs load_time).
- Define SLAs and tolerances: e.g., freshness <= 30 min; volume deviation <= 25%.
- Write SQL checks: compute lag minutes and row counts vs baseline.
- Schedule checks: run slightly after ingestion/model runs with a grace period.
- Record results: write pass/fail plus metrics (lag_minutes, observed_count, baseline_count, percent_diff).
- Alert and route: send to chat/on-call when thresholds are breached; suppress during planned backfills.
Worked examples
Example 1 — Freshness on orders model (SLA 20 minutes)
Goal: Alert if the newest order is more than 20 minutes old.
-- Postgres-style SQL (adapt functions for your warehouse)
WITH metrics AS (
SELECT
MAX(updated_at) AS max_ts,
(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP) - EXTRACT(EPOCH FROM MAX(updated_at))) / 60.0 AS lag_minutes
FROM analytics.orders
)
SELECT
'orders_freshness' AS check_name,
lag_minutes,
CASE WHEN lag_minutes <= 20 THEN 'pass' ELSE 'fail' END AS status
FROM metrics;Tip: If you use load_time, ensure it updates with each ingestion.
Example 2 — Volume vs 7-day median (tolerance ±25%)
Goal: Compare today’s rows to a 7-day median.
WITH today AS (
SELECT DATE_TRUNC('day', event_time) AS d, COUNT(*) AS c
FROM raw.events
WHERE event_time >= DATE_TRUNC('day', CURRENT_DATE)
GROUP BY 1
), hist AS (
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY c) AS median_c
FROM (
SELECT DATE_TRUNC('day', event_time) AS d, COUNT(*) AS c
FROM raw.events
WHERE event_time >= CURRENT_DATE - INTERVAL '14 days'
AND event_time < DATE_TRUNC('day', CURRENT_DATE)
GROUP BY 1
) x
)
SELECT
'events_volume_daily' AS check_name,
t.c AS observed_count,
h.median_c AS baseline_count,
CASE WHEN h.median_c = 0 THEN NULL ELSE (t.c - h.median_c) / h.median_c END AS percent_diff,
CASE WHEN h.median_c = 0 THEN 'warn'
WHEN ABS((t.c - h.median_c) / h.median_c) <= 0.25 THEN 'pass'
ELSE 'fail' END AS status
FROM today t CROSS JOIN hist h;Use median for robustness against spikes. Consider weekday median for strong day-of-week seasonality.
Example 3 — Missing hourly partitions (last 24 hours)
Goal: Ensure each hour has at least one row.
WITH hours AS (
SELECT generate_series(
DATE_TRUNC('hour', CURRENT_TIMESTAMP - INTERVAL '24 hours'),
DATE_TRUNC('hour', CURRENT_TIMESTAMP - INTERVAL '1 hour'),
INTERVAL '1 hour'
) AS hr
), seen AS (
SELECT DATE_TRUNC('hour', event_time) AS hr, COUNT(*) c
FROM raw.events
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL '24 hours'
GROUP BY 1
)
SELECT h.hr, COALESCE(s.c, 0) AS row_count,
CASE WHEN COALESCE(s.c, 0) > 0 THEN 'present' ELSE 'missing' END AS status
FROM hours h
LEFT JOIN seen s USING (hr)
ORDER BY h.hr;Alert if any status = 'missing'. Add a 5–10 minute grace window for the most recent hour.
Exercises
Do these to lock in the skill. You can take the quick test afterward. The test is available to everyone; log in to save your progress.
- Exercise 1 — Freshness and daily volume checks on orders
Write SQL that returns two rows: one for freshness, one for volume deviation.
- Table: analytics.orders(id, created_at, updated_at, status)
- Freshness SLA: newest updated_at should be ≤ 75 minutes old
- Volume tolerance: today’s count vs 7-day median; fail if |diff| > 30%
Return columns: check_name, status, metric_value, details.
- Exercise 2 — Find missing hourly buckets
For raw.events(event_time), list hours in the last 12 hours that have zero rows. Add a grace period of 8 minutes for the latest hour so it never falsely alerts.
Exercise checklist
- Picked the correct timestamp (event vs load).
- Added a grace period for near-real-time windows.
- Used a robust baseline (median or weekday median).
- Returned clear pass/fail plus metrics.
Common mistakes and self-check
- Using the wrong timestamp column. Self-check: Does the SLA describe business event timeliness or pipeline timeliness? Align accordingly.
- No grace period. Self-check: Do checks run after ingestion completes? Add 5–10 minutes if not.
- Baseline too short. Self-check: Is your lookback covering seasonality (≥ 7–28 days)?
- Alerting on first failure. Self-check: For noisy sources, require two consecutive fails or larger threshold breaches.
- Zero-division/empty history. Self-check: Handle median=0 or missing days with a warning state.
Practical projects
- Add freshness and volume checks to your 3 most important models (facts used by exec dashboards). Store check results in a quality_results table.
- Build a “Data Health” dashboard showing current lag, counts vs baseline, and last failure time.
- Add suppression windows for planned backfills to reduce false alarms.
Learning path
- Start with freshness on one critical model.
- Add daily volume checks with a robust baseline.
- Introduce hourly partition checks for event streams.
- Centralize results and add simple alerting.
- Tune thresholds using historical distributions.
Mini challenge
Your product usage events spike by +120% today after a marketing campaign. Volume check failed. How would you reduce false alarms without missing real issues?
Possible approach
- Use weekday median baseline plus a looser upper tolerance on campaign days.
- Allow one-time silence when a scheduled campaign is documented.
- Add a second check on bad rows (e.g., null user_id) to catch true data issues.
Next steps
- Implement one freshness and one volume check today.
- Run for a week, collect metrics, then tune thresholds.
- Take the Quick Test below to validate understanding. Note: The test is available to everyone; only logged-in users get saved progress.