Why this matters
Freshness and completeness checks keep your pipelines trustworthy. Teams depend on timely data (freshness) and full coverage without gaps or excessive nulls (completeness) to make decisions, train models, close books, and trigger automations.
- Alert when the latest partition is late or missing.
- Block downstream jobs if required columns have too many nulls.
- Catch partial loads (e.g., only half of regions arrived).
- Detect regressions after schema or pipeline changes.
Concept explained simply
Freshness: Is the data recent enough for its promised schedule? You compare the latest available timestamp or partition with now (or the expected period).
Completeness: Do we have all the rows and fields we expect? You check record counts, required keys, and null rates for critical columns.
Mental model
Think of a train timetable:
- Freshness: Did today’s train arrive by the scheduled time?
- Completeness: Did all cars arrive and are all seats filled with valid tickets (no missing critical info)?
Define clear expectations
- Schedule & window: e.g., "Daily by 02:00 UTC, covering previous day" or "Streaming events within 10 minutes."
- Time source: Prefer an event timestamp (
event_time) over ingestion time for freshness logic; use ingestion time for operational lag. - Grain: Partition/grain (day/hour) used to verify, e.g.,
sale_datedaily. - Completeness targets: Expected rows (absolute, percent-of-baseline) and null thresholds per critical column.
- Late data policy: Accept small late trickle (watermark) or require strict cutoff.
Step-by-step: Add checks to a pipeline
- Identify the authoritative time column (e.g.,
event_time). - Write a query that returns the latest time seen and a boolean
is_fresh. - Write queries that compute:
- Row count vs baseline
- Null ratio per required column
- Coverage of expected keys (e.g., regions)
- Store results in an audit table with
checked_at,metric,value,status. - Alert or fail the job if status is not OK. Allow manual override if policy permits.
Worked examples
Example 1 — Daily batch table freshness
Expectation: Yesterday's data is available by 02:00 UTC. Fresh if latest sale_date = current_date - 1 and last update happened <= 3 hours ago.
-- Generic SQL (PostgreSQL-like)
SELECT
NOW() AS checked_at,
MAX(updated_at) AS latest_loaded_at,
DATE(MAX(sale_date)) AS latest_sale_date,
EXTRACT(EPOCH FROM (NOW() - MAX(updated_at)))/3600.0 AS hours_since_update,
CASE
WHEN DATE(MAX(sale_date)) >= CURRENT_DATE - INTERVAL '1 day'
AND EXTRACT(EPOCH FROM (NOW() - MAX(updated_at)))/3600.0 <= 3
THEN TRUE ELSE FALSE END AS is_fresh
FROM analytics.sales_daily;
Example 2 — Streaming freshness with watermark
Expectation: 95% of events arrive within 10 minutes of event time. Use a watermark and percent coverage.
WITH window AS (
SELECT NOW() AS checked_at,
NOW() - INTERVAL '10 minutes' AS watermark
), stats AS (
SELECT COUNT(*) AS total_recent,
SUM(CASE WHEN event_time >= (SELECT watermark FROM window) THEN 1 ELSE 0 END) AS within_window
FROM events.stream
WHERE event_time >= NOW() - INTERVAL '1 hour'
)
SELECT (within_window::float / NULLIF(total_recent,0)) AS pct_within_window,
(within_window::float / NULLIF(total_recent,0)) >= 0.95 AS is_fresh
FROM stats;
Example 3 — Completeness by key coverage
Expectation: All 50 US states appear daily in a summary table.
WITH expected AS (
SELECT UNNEST(ARRAY['AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI','ID','IL','IN','IA','KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY']) AS state
), latest AS (
SELECT state, sale_date
FROM analytics.daily_state_sales
WHERE sale_date = CURRENT_DATE - INTERVAL '1 day'
)
SELECT e.state,
CASE WHEN l.state IS NULL THEN 'MISSING' ELSE 'OK' END AS status
FROM expected e
LEFT JOIN latest l USING (state);
Example 4 — Null ratio for required columns
Expectation: order_id and customer_id have 0% nulls; order_total nulls <= 1%.
SELECT 'order_id' AS column_name,
AVG(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END)::float AS null_ratio
FROM analytics.orders
UNION ALL
SELECT 'customer_id', AVG(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END)::float FROM analytics.orders
UNION ALL
SELECT 'order_total', AVG(CASE WHEN order_total IS NULL THEN 1 ELSE 0 END)::float FROM analytics.orders;
Choosing thresholds and alerts
- Daily batch: latest partition date equals yesterday; last load within N hours of schedule.
- Streaming: percent within watermark (e.g., 95% within 10 minutes).
- Row counts: at least 80–120% of rolling median of the last 7 comparable periods (handle weekday vs weekend separately).
- Nulls: 0% for identifiers and join keys; small non-zero tolerance for optional attributes.
Implementation patterns
- Query-based checks: Run SQL checks post-load and persist results in an audit table.
- Pre-load guards: Validate input files (count, schema) before ingest.
- Gating: Fail downstream models if checks are not OK; allow controlled retries/backfills.
- Partition-aware: Check the latest partition only for speed; sample older partitions periodically.
Exercises
Do these hands-on tasks. Then compare with the solutions in each exercise card below.
- Exercise 1: Compute freshness lag and a boolean
is_freshfor a daily table. - Exercise 2: Calculate null ratios for required columns and flag any that exceed thresholds.
- [Checklist] Define the time column and schedule.
- [Checklist] Decide thresholds and late data policy.
- [Checklist] Write queries that return a single-row verdict.
- [Checklist] Save results to an audit table with timestamps.
Common mistakes and self-check
- Using ingestion time instead of business event time for freshness when the business cares about event recency. Self-check: Confirm which timestamp downstream users reference.
- Ignoring weekends/holidays. Self-check: Compare weekday vs weekend baselines separately.
- Not accounting for late-arriving data. Self-check: Choose a watermark and measure on-time percentage.
- Only checking total row count. Self-check: Add key coverage and column null ratios.
- Timezone drift. Self-check: Normalize to UTC in checks.
Practical projects
- Build a daily freshness monitor that writes a status row per table into an audit table and a lightweight dashboard that shows Red/Amber/Green.
- Create a completeness suite: counts vs baseline, null ratios, and key coverage for your top 5 critical tables.
- Implement gating in your orchestrator so downstream jobs stop when
is_fresh = falseor null ratios exceed thresholds.
Mini challenge
Your marketing events stream shows 90% within 10 minutes during weekdays but drops to 80% on weekends due to a partner system. Propose an updated policy that protects weekday SLAs without causing constant weekend alerts. Write the policy in 2–3 bullet points.
Who this is for
- Data Engineers owning batch or streaming pipelines.
- Analytics Engineers ensuring model reliability.
- Platform Engineers adding data quality guardrails.
Prerequisites
- Comfort with SQL (GROUP BY, window functions, conditional aggregation).
- Basic understanding of batch vs streaming workflows and partitions.
- Familiarity with timezones and UTC normalization.
Learning path
- Identify critical data products and write their freshness/completeness expectations.
- Implement SQL checks for one table and store results in an audit table.
- Add thresholds based on rolling medians per weekday.
- Scale to multiple tables and add gating/alerts.
- Review and adjust thresholds monthly.
Next steps
- Automate check generation from metadata (e.g., required columns list).
- Add anomaly detection for counts and null ratios.
- Expand checks to include uniqueness and referential integrity.
Quick Test
Take the quick test below to check your understanding. Everyone can take it for free. Only logged-in users have their progress saved.