luvv to helpDiscover the Best Free Online Tools
Topic 4 of 7

Freshness SLAs And Monitoring

Learn Freshness SLAs And Monitoring for free with explanations, exercises, and a quick test (for MLOps Engineer).

Published: January 4, 2026 | Updated: January 4, 2026

Why this matters

As an MLOps Engineer, you ensure models get fresh, reliable features. If features are stale, real-time predictions degrade, alerts fire late, and business decisions lag. You will define freshness SLAs, instrument pipelines, and monitor lag so issues are caught before customers notice.

  • Real tasks: set SLO/SLAs for feature currency, add metrics to batch/stream pipelines, build alerts and dashboards, and create runbooks for incidents.
  • Outcomes: fewer stale predictions, predictable latency, and clear accountability with data producers.

Who this is for

  • MLOps Engineers building/operating feature stores
  • Data Engineers owning pipelines feeding features
  • ML Engineers needing dependable online/offline features

Prerequisites

  • Basic SQL and scheduling (cron, Airflow, etc.)
  • Understanding of event_time vs processing_time
  • Familiarity with feature store concepts (online/offline, TTL)

Learning path

  • Feature Data Modeling → Feature Store Operations → Freshness SLAs and Monitoring → Data Quality and Drift Monitoring → Alerting and Runbooks
  • Then: Cost vs Freshness Tuning → Reliability Engineering for Pipelines

Concept explained simply

Freshness is how up-to-date a feature value is at the time your model uses it. A freshness SLA is the maximum allowed age or delay, agreed with producers and consumers.

Mental model

Imagine a newsroom ticker. If the news is over your agreed threshold (e.g., older than 15 minutes), it's not actionable. Your job is to keep the ticker current and ring an alarm if it falls behind.

Key terms

  • SLI (Indicator): the measurement (e.g., age = now - event_time percentile at serving).
  • SLO (Objective): target for the SLI (e.g., p95 age ≤ 10 minutes over 30 days).
  • SLA (Agreement): formal commitment/contract (often includes consequences).
Common timestamps to compare
  • Event-based features: now/serve_time - event_time (preferred for real-world timeliness)
  • Snapshot/batch features: ready_time - scheduled_ready_time, or now - last_update_time
  • Online store freshness: now - last_write_time (per key or aggregate)

How to set a Freshness SLA (step-by-step)

  1. Map decisions to freshness: What is the latest acceptable age before predictions degrade?
  2. Choose SLIs: e.g., p95(now - event_time) at serving; batch lateness in minutes; % keys fresher than TTL.
  3. Pick SLOs: e.g., p95 age ≤ 5 min (stream), daily batch completed by 03:30 UTC, ≥ 99% keys updated within 24h.
  4. Instrument: add timestamps to outputs; emit metrics per run/partition and per key bucket; record distributions (p50, p95, p99).
  5. Alert: page on hard SLA breaches; ticket on SLO trend regression; annotate planned maintenance/backfills.
  6. Runbook: include triage steps, owners, rollback/fallback, and communication template.
Runbook starter (copy/paste)
Signal: Batch job X late by >30 min
Check: Scheduler status → upstream source lag → recent code changes → infrastructure errors
Mitigate: Retry pipeline; reprocess missing partition; activate fallback features; widen cache TTL temporarily
Comms: Post status, ETA, and impacted models; annotate dashboard
Postmortem: Root cause, prevention item, SLA impact window

Worked examples

Example 1: Daily batch feature

Feature "user_daily_spend" should be ready by 03:00 UTC. SLA: completed by 03:30.

  • Yesterday completed 03:12 → on time
  • Today at 03:41 still no partition → SLA breach

SLI: ready_time - 03:00. SLO: ≤ 30 min. Action: alert at 03:31 if missing; runbook: check upstream partition for D-1.

Example 2: Streaming feature age

Feature "clicks_last_5m" from Kafka. SLO: p99(now - event_time) ≤ 5 min.

If a broker slowdown adds 4 minutes, the p99 jumps to 7 min. You alert even if average looks fine.

Example 3: Online store TTL

Feature "merchant_risk_score" refreshed hourly; TTL = 24h. Guardrail: <= 1% of keys older than 24h.

If 6% keys exceed TTL after an upstream outage, you trigger a rebuild of affected segments and enable a fallback model.

Monitoring blueprint: what to measure

  • Age distribution at serving: now - event_time (p50/p95/p99)
  • Batch timeliness: actual_ready - scheduled_ready; % partitions on time
  • Online store currency: % keys within TTL; max age; age histogram
  • Upstream lag: source watermark delay; queue lag
  • Coverage: % of predictions using fallback due to staleness
  • Annotations: deployments, backfills, maintenance windows
Self-serve checks
  • Can you compute age with only your written metadata?
  • Do you track per-key or per-segment, not just global averages?
  • Are timezones and clock sync documented?

Implementation patterns

Batch (daily/hourly)

  • Emit run metadata: scheduled_time, start_time, end_time, output_partition
  • Write a control record into a "runs" table and a final checkpoint file
  • Alert on missing partition or lateness threshold

Streaming

  • Carry event_time through the pipeline
  • At serving, log now - event_time for a sample of requests
  • Aggregate p95/p99 over 5–15 min windows

Online store

  • Store last_update_time per key or maintain a side index
  • Sample keys periodically to estimate age distribution
  • TTL guardrails + targeted backfills

Exercises (do these now)

Try the tasks below. A checklist follows to confirm your work.

Exercise 1: Batch timeliness SLA

You have a daily job with scheduled_ready_time 03:00 UTC and SLA 30 minutes. Data for the last 3 days:

run_datescheduled_ready_timeactual_ready_time
2026-01-0103:0003:12
2026-01-0203:0003:28
2026-01-0303:0003:41

Task: compute lateness (minutes) and flag SLA breach per day. Decide alert severity for 2026-01-03.

Show solution

Lateness: 12, 28, 41 minutes. SLA 30 min → breaches on 2026-01-03 only. Severity: page if user-facing predictions depend on it; otherwise create an incident, trigger retry, and communicate ETA.

Exercise 2: Streaming age percentiles

Sample serving logs (now - event_time in minutes): [1, 2, 2, 3, 3, 4, 5, 7, 8, 10]. SLO: p99 ≤ 5 min.

Task: estimate p95 and p99; determine SLO status.

Show solution

Sorted values are already shown. p95 ~ between 9th and 10th values ≈ 9.5th percentile → about 10. p99 ~ near max → about 10. Both > 5 → SLO violated. Investigate source lag or consumer slowdown.

Self-check checklist

  • I can compute lateness from run metadata
  • I can estimate p95/p99 age from serving logs
  • I know when to page vs create a ticket
  • I can explain event_time vs processing_time to a teammate

Common mistakes and how to self-check

  • Using processing_time instead of event_time → underestimates age during upstream delays. Self-check: compare both on a sample; differences > 1–2 minutes in stream suggest issues.
  • Only tracking averages → hides tail problems. Self-check: add p95/p99 and max.
  • No per-key segmentation → hotspots hidden. Self-check: stratify by tenant/region/feature group.
  • Timezone/clock drift → false alerts. Self-check: UTC everywhere; NTP enabled; log clock skew.
  • Alert spam during planned backfills → add maintenance annotations and temporary threshold overrides.
  • Conflating training and serving freshness → measure both; don’t assume one implies the other.

Practical projects

  • Build a batch freshness monitor: a small table of runs with a daily email summary and a simple HTML dashboard.
  • Implement a streaming age logger at serving and export p95/p99 metrics.
  • Create an online-store TTL scanner that samples keys hourly and reports % older than TTL.
  • Write a one-page runbook for your most critical feature pipeline.

Quick Test

Take the quick test below to check mastery. Available to everyone; if you are logged in, your progress will be saved.

Next steps

  • Add freshness annotations to your dashboards (deploys, backfills)
  • Align SLOs with model owners and data producers
  • Automate postmortems and weekly SLO reviews

Mini challenge

Your streaming feature meets p95 ≤ 3 min but violates p99 during peak hours. In a short plan, propose: which metrics to add, what alert to change, and one mitigation to reduce tail latency. Keep it to 5 bullet points.

Practice Exercises

2 exercises to complete

Instructions

You run a daily batch job. SLA: completed within 30 minutes after 03:00 UTC. Given data:

run_datescheduled_ready_timeactual_ready_time
2026-01-0103:0003:12
2026-01-0203:0003:28
2026-01-0303:0003:41
  • Compute lateness in minutes per day
  • Flag SLA breach per day
  • Decide alert severity for 2026-01-03 and the first runbook step
Expected Output
Lateness: [12, 28, 41]. Breach on 2026-01-03 only. Severity: page if user-facing; first step: check upstream partition and scheduler logs.

Freshness SLAs And Monitoring — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Freshness SLAs And Monitoring?

AI Assistant

Ask questions about this tool