How to learn Freshness And Completeness Checks for Data Quality And Reliability in Data Engineer for free

Why this matters

Freshness and completeness checks keep your pipelines trustworthy. Teams depend on timely data (freshness) and full coverage without gaps or excessive nulls (completeness) to make decisions, train models, close books, and trigger automations.

Alert when the latest partition is late or missing.
Block downstream jobs if required columns have too many nulls.
Catch partial loads (e.g., only half of regions arrived).
Detect regressions after schema or pipeline changes.

Concept explained simply

Freshness: Is the data recent enough for its promised schedule? You compare the latest available timestamp or partition with now (or the expected period).

Completeness: Do we have all the rows and fields we expect? You check record counts, required keys, and null rates for critical columns.

Mental model

Think of a train timetable:

Freshness: Did today’s train arrive by the scheduled time?
Completeness: Did all cars arrive and are all seats filled with valid tickets (no missing critical info)?

Define clear expectations

Schedule & window: e.g., "Daily by 02:00 UTC, covering previous day" or "Streaming events within 10 minutes."
Time source: Prefer an event timestamp (event_time) over ingestion time for freshness logic; use ingestion time for operational lag.
Grain: Partition/grain (day/hour) used to verify, e.g., sale_date daily.
Completeness targets: Expected rows (absolute, percent-of-baseline) and null thresholds per critical column.
Late data policy: Accept small late trickle (watermark) or require strict cutoff.

Step-by-step: Add checks to a pipeline

Identify the authoritative time column (e.g., event_time).
Write a query that returns the latest time seen and a boolean is_fresh.
Write queries that compute:
- Row count vs baseline
- Null ratio per required column
- Coverage of expected keys (e.g., regions)
Store results in an audit table with checked_at, metric, value, status.
Alert or fail the job if status is not OK. Allow manual override if policy permits.

Worked examples

Example 1 — Daily batch table freshness

Expectation: Yesterday's data is available by 02:00 UTC. Fresh if latest sale_date = current_date - 1 and last update happened <= 3 hours ago.

-- Generic SQL (PostgreSQL-like)
SELECT
  NOW() AS checked_at,
  MAX(updated_at) AS latest_loaded_at,
  DATE(MAX(sale_date)) AS latest_sale_date,
  EXTRACT(EPOCH FROM (NOW() - MAX(updated_at)))/3600.0 AS hours_since_update,
  CASE
    WHEN DATE(MAX(sale_date)) >= CURRENT_DATE - INTERVAL '1 day'
     AND EXTRACT(EPOCH FROM (NOW() - MAX(updated_at)))/3600.0 <= 3
    THEN TRUE ELSE FALSE END AS is_fresh
FROM analytics.sales_daily;

Example 2 — Streaming freshness with watermark

Expectation: 95% of events arrive within 10 minutes of event time. Use a watermark and percent coverage.

WITH window AS (
  SELECT NOW() AS checked_at,
         NOW() - INTERVAL '10 minutes' AS watermark
), stats AS (
  SELECT COUNT(*) AS total_recent,
         SUM(CASE WHEN event_time >= (SELECT watermark FROM window) THEN 1 ELSE 0 END) AS within_window
  FROM events.stream
  WHERE event_time >= NOW() - INTERVAL '1 hour'
)
SELECT (within_window::float / NULLIF(total_recent,0)) AS pct_within_window,
       (within_window::float / NULLIF(total_recent,0)) >= 0.95 AS is_fresh
FROM stats;

Example 3 — Completeness by key coverage

Expectation: All 50 US states appear daily in a summary table.

WITH expected AS (
  SELECT UNNEST(ARRAY['AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI','ID','IL','IN','IA','KS','KY','LA','ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY']) AS state
), latest AS (
  SELECT state, sale_date
  FROM analytics.daily_state_sales
  WHERE sale_date = CURRENT_DATE - INTERVAL '1 day'
)
SELECT e.state,
       CASE WHEN l.state IS NULL THEN 'MISSING' ELSE 'OK' END AS status
FROM expected e
LEFT JOIN latest l USING (state);

Example 4 — Null ratio for required columns

Expectation: order_id and customer_id have 0% nulls; order_total nulls <= 1%.

SELECT 'order_id' AS column_name,
       AVG(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END)::float AS null_ratio
FROM analytics.orders
UNION ALL
SELECT 'customer_id', AVG(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END)::float FROM analytics.orders
UNION ALL
SELECT 'order_total', AVG(CASE WHEN order_total IS NULL THEN 1 ELSE 0 END)::float FROM analytics.orders;

Choosing thresholds and alerts

Daily batch: latest partition date equals yesterday; last load within N hours of schedule.
Streaming: percent within watermark (e.g., 95% within 10 minutes).
Row counts: at least 80–120% of rolling median of the last 7 comparable periods (handle weekday vs weekend separately).
Nulls: 0% for identifiers and join keys; small non-zero tolerance for optional attributes.

Implementation patterns

Query-based checks: Run SQL checks post-load and persist results in an audit table.
Pre-load guards: Validate input files (count, schema) before ingest.
Gating: Fail downstream models if checks are not OK; allow controlled retries/backfills.
Partition-aware: Check the latest partition only for speed; sample older partitions periodically.

Exercises

Do these hands-on tasks. Then compare with the solutions in each exercise card below.

Exercise 1: Compute freshness lag and a boolean is_fresh for a daily table.
Exercise 2: Calculate null ratios for required columns and flag any that exceed thresholds.

[Checklist] Define the time column and schedule.
[Checklist] Decide thresholds and late data policy.
[Checklist] Write queries that return a single-row verdict.
[Checklist] Save results to an audit table with timestamps.

Common mistakes and self-check

Using ingestion time instead of business event time for freshness when the business cares about event recency. Self-check: Confirm which timestamp downstream users reference.
Ignoring weekends/holidays. Self-check: Compare weekday vs weekend baselines separately.
Not accounting for late-arriving data. Self-check: Choose a watermark and measure on-time percentage.
Only checking total row count. Self-check: Add key coverage and column null ratios.
Timezone drift. Self-check: Normalize to UTC in checks.

Practical projects

Build a daily freshness monitor that writes a status row per table into an audit table and a lightweight dashboard that shows Red/Amber/Green.
Create a completeness suite: counts vs baseline, null ratios, and key coverage for your top 5 critical tables.
Implement gating in your orchestrator so downstream jobs stop when is_fresh = false or null ratios exceed thresholds.

Mini challenge

Your marketing events stream shows 90% within 10 minutes during weekdays but drops to 80% on weekends due to a partner system. Propose an updated policy that protects weekday SLAs without causing constant weekend alerts. Write the policy in 2–3 bullet points.

Who this is for

Data Engineers owning batch or streaming pipelines.
Analytics Engineers ensuring model reliability.
Platform Engineers adding data quality guardrails.

Prerequisites

Comfort with SQL (GROUP BY, window functions, conditional aggregation).
Basic understanding of batch vs streaming workflows and partitions.
Familiarity with timezones and UTC normalization.

Learning path

Identify critical data products and write their freshness/completeness expectations.
Implement SQL checks for one table and store results in an audit table.
Add thresholds based on rolling medians per weekday.
Scale to multiple tables and add gating/alerts.
Review and adjust thresholds monthly.

Next steps

Automate check generation from metadata (e.g., required columns list).
Add anomaly detection for counts and null ratios.
Expand checks to include uniqueness and referential integrity.

Quick Test

Take the quick test below to check your understanding. Everyone can take it for free. Only logged-in users have their progress saved.

Menu

Freshness And Completeness Checks

Table of Contents