Menu

Topic 2 of 8

Freshness Completeness Accuracy Checks

Learn Freshness Completeness Accuracy Checks for free with explanations, exercises, and a quick test (for Data Architect).

Published: January 18, 2026 | Updated: January 18, 2026

Why this matters

As a Data Architect, you define data contracts and ensure downstream analytics, ML features, and dashboards are trustworthy. Freshness, completeness, and accuracy (F-C-A) are the core signals stakeholders rely on to decide: “Can I use this data right now?”

  • Set and enforce SLAs/SLOs for pipelines and tables.
  • Detect and triage incidents before business users do.
  • Design platform patterns so teams implement quality checks consistently.
  • Communicate data reliability in simple, objective terms.

Concept explained simply

  • Freshness: Is the data up-to-date? Measured as the delay between now (or expected arrival time) and the most recent data timestamp. Example metric: now() - max(event_time).
  • Completeness: Do we have all the data we expect? Measured as a percentage of expected records, partitions, or non-null fields that actually arrived. Example metric: received_rows / expected_rows.
  • Accuracy: Are the values correct? Checked against rules, trusted references, or reconciliations. Example: revenue_total in warehouse matches payments provider within 0.5% tolerance.
Mental model

Think of a three-light status panel:

  • Freshness: green if data arrived on time.
  • Completeness: green if expected scope is present.
  • Accuracy: green if values conform to truth or rules.

All three should be green for decision-grade data. If one is red, communicate which and why.

Set targets and thresholds

Step 1 — Define use-case windows

Ask consumers how recent the data must be (freshness), how much of it they need to trust a decision (completeness), and what error tolerance is acceptable (accuracy).

Step 2 — Turn needs into SLOs
  • Freshness SLO example: 95% of days, table is updated by 02:00 UTC.
  • Completeness SLO example: 99% of required partitions present by 03:00 UTC.
  • Accuracy SLO example: warehouse revenue within 0.3% of payments ledger daily.
Step 3 — Choose alert thresholds
  • Warning near the boundary (e.g., at 80–90% of max delay).
  • Critical beyond the boundary (e.g., exceeds SLA).
Step 4 — Document assumptions

Specify timezones, late-arriving data policies, expected partitions, and reference sources for accuracy.

Edge cases to plan for
  • Late-arriving events: allow a grace window or run backfills.
  • Sparse tables: use arrival cadence (e.g., weekly) instead of minutes.
  • Clock skew/timezones: standardize on UTC and consistent timestamp columns.
  • Changing business volumes: define completeness as a ratio, not an absolute count.

Worked examples

Example 1 — Daily sales table

Context: Retail sales are ingested nightly to fact_sales.

  • Freshness: Expect update by 02:00 UTC. Metric: 02:00 - max(sale_date_time). If delay > 30 min, warning; > 60 min, critical.
  • Completeness: Expected rows from POS logs vs rows loaded. Metric: loaded_rows / expected_rows. Warning at 98–99%; critical below 98%.
  • Accuracy: Reconcile revenue. Metric: |sum(fact_sales.amount) - payments_total| / payments_total. Critical if > 0.5%.
Sample SQL (conceptual)
-- Freshness (minutes)
SELECT EXTRACT(EPOCH FROM (NOW() - MAX(sale_date_time)))/60 AS lag_min FROM fact_sales WHERE sale_date = CURRENT_DATE;
-- Completeness
WITH expected AS (
  SELECT COUNT(*) AS c FROM pos_log WHERE sale_date = CURRENT_DATE
), loaded AS (
  SELECT COUNT(*) AS c FROM fact_sales WHERE sale_date = CURRENT_DATE
)
SELECT loaded.c::float/NULLIF(expected.c,0) AS completeness_ratio FROM loaded, expected;
-- Accuracy
WITH w AS (SELECT SUM(amount) AS amt FROM fact_sales WHERE sale_date = CURRENT_DATE),
     p AS (SELECT SUM(total)  AS amt FROM payments_ledger WHERE sale_date = CURRENT_DATE)
SELECT ABS(w.amt - p.amt)/NULLIF(p.amt,0) AS rel_diff;

Example 2 — Streaming events topic

Context: Kafka topic for clickstream, consumed into a bronze table.

  • Freshness: last_event_ts within 5 minutes of now. If > 5 min, critical.
  • Completeness: Expect N partitions per hour; verify each partition contains at least one file/message batch.
  • Accuracy: Enforce schema rules (e.g., country_code in known list); drop or quarantine violations.
Notes
  • Use watermarks to handle out-of-order events.
  • Track both ingestion_time and event_time to diagnose source vs pipeline delay.

Example 3 — Customer addresses

Context: Dimensions used by marketing and shipping.

  • Freshness: updated_at not older than 24 hours.
  • Completeness: required fields not null (street, city, postal_code, country). Threshold: 99.5% non-null.
  • Accuracy: postal_code format by country and city-country geocode consistency. Tolerance: reject mismatches; quarantine offending rows.
Practical tip

For accuracy without a single truth source, combine rule checks (formats, ranges) with reconciliation (counts, totals) and anomaly detection on distributions.

How to implement checks consistently

  • Where: close to data boundaries (ingestion), at key transformations, and before serving layers.
  • When: on schedule (batch) and on arrival windows (streaming) with watermarks.
  • What to store: raw metrics (lag minutes, ratio), thresholds, final status (pass/warn/fail), and run metadata (dataset, partition, timestamp).
  • How to alert: route critical to on-call; warnings to owners or daily digest. Include context and suggested first steps.
Recommended metrics to log per run
  • Freshness: max_timestamp, lag_minutes, window_start/end.
  • Completeness: expected, observed, ratio, keys checked (e.g., dates/partitions).
  • Accuracy: rule name, sample size, violations count, relative difference when reconciling.

Exercises

Try these before viewing solutions.

Exercise 1 — Compute freshness and classify

Given max(event_time) = 2025-04-10 01:13 UTC and current time = 2025-04-10 02:00 UTC. SLA: data updated within 45 minutes. Warning at 80% of SLA. Compute lag and status.

Exercise 2 — Partition completeness

You expect 24 hourly partitions for 2025-04-09 in table sensor_readings. Present partitions: 00–10 and 13–23 (missing 11,12). Set thresholds: warning if completeness < 99%, critical if < 95%. Compute completeness and status. Suggest one remediation step.

Checklist — Did you cover essentials?
  • Used UTC consistently.
  • Defined both SLO and alert thresholds.
  • Computed ratios, not just absolute counts.
  • Considered late data policy and grace windows.
  • Captured metrics for auditing and dashboards.

Common mistakes and self-check

  • Using load_time instead of business event_time for freshness. Self-check: compare both; track both to diagnose delay source.
  • Defining completeness as a fixed row count. Self-check: make it a ratio against an expectation source or partition list.
  • “Accuracy” equals “no nulls.” Self-check: add value rules and reconciliations against reference totals.
  • No tolerance bands. Self-check: define warn/crit to avoid alert fatigue.
  • Ignoring timezones. Self-check: normalize to UTC end-to-end.

Practical projects

  • Project A: Build a freshness dashboard for top 10 tables (lag minutes, trend, SLO attainment).
  • Project B: Implement partition completeness for one daily fact table, including expected partitions, ratio, and alerts.
  • Project C: Create an accuracy reconciliation job between warehouse revenue and an external ledger with 0.5% tolerance and nightly run.

Mini challenge

It is 03:10 UTC. For table orders_daily on 2025-05-01: max(order_ts)=02:35, expected rows=120k, loaded rows=116.4k, payments_total=$5,020,000, warehouse_total=$4,995,100. SLAs: freshness 45 min, completeness critical < 98%, accuracy critical > 0.5% diff. Classify each and name the likely cause for any failures.

Hint
  • Compute lag: now - max_ts.
  • Completeness: loaded/expected.
  • Accuracy: abs(diff)/reference.

Who this is for

  • Data Architects defining reliability standards.
  • Data Product Owners needing trust signals for stakeholders.

Prerequisites

  • Comfort with SQL and dataset schemas.
  • Basic understanding of batch and streaming pipelines.
  • Familiarity with SLAs/SLOs and monitoring concepts.

Learning path

  • Start: Freshness, Completeness, Accuracy checks (this page).
  • Next: Data contracts and schema evolution.
  • Then: Incident response playbooks and root cause analysis.
  • Finally: Reliability SLO dashboards and governance reviews.

Next steps

  • Implement one freshness and one completeness check on a critical table.
  • Add a single high-impact accuracy reconciliation.
  • Review thresholds after one week using observed distributions.

Quick test and progress

You can take the quick test below for free. Everyone can attempt it; only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Given max(event_time) = 2025-04-10 01:13 UTC and current time = 2025-04-10 02:00 UTC. SLA: data updated within 45 minutes. Warning at 80% of SLA. Compute the freshness lag in minutes and the status (OK/WARN/CRIT).

Expected Output
Lag = 47 minutes; Status = CRIT (exceeds 45-minute SLA).

Freshness Completeness Accuracy Checks — Quick Test

Test your knowledge with 7 questions. Pass with 70% or higher.

7 questions70% to pass

Have questions about Freshness Completeness Accuracy Checks?

AI Assistant

Ask questions about this tool