Why this matters
As an MLOps Engineer, you ensure models get fresh, reliable features. If features are stale, real-time predictions degrade, alerts fire late, and business decisions lag. You will define freshness SLAs, instrument pipelines, and monitor lag so issues are caught before customers notice.
- Real tasks: set SLO/SLAs for feature currency, add metrics to batch/stream pipelines, build alerts and dashboards, and create runbooks for incidents.
- Outcomes: fewer stale predictions, predictable latency, and clear accountability with data producers.
Who this is for
- MLOps Engineers building/operating feature stores
- Data Engineers owning pipelines feeding features
- ML Engineers needing dependable online/offline features
Prerequisites
- Basic SQL and scheduling (cron, Airflow, etc.)
- Understanding of event_time vs processing_time
- Familiarity with feature store concepts (online/offline, TTL)
Learning path
- Feature Data Modeling → Feature Store Operations → Freshness SLAs and Monitoring → Data Quality and Drift Monitoring → Alerting and Runbooks
- Then: Cost vs Freshness Tuning → Reliability Engineering for Pipelines
Concept explained simply
Freshness is how up-to-date a feature value is at the time your model uses it. A freshness SLA is the maximum allowed age or delay, agreed with producers and consumers.
Mental model
Imagine a newsroom ticker. If the news is over your agreed threshold (e.g., older than 15 minutes), it's not actionable. Your job is to keep the ticker current and ring an alarm if it falls behind.
Key terms
- SLI (Indicator): the measurement (e.g., age = now - event_time percentile at serving).
- SLO (Objective): target for the SLI (e.g., p95 age ≤ 10 minutes over 30 days).
- SLA (Agreement): formal commitment/contract (often includes consequences).
Common timestamps to compare
- Event-based features: now/serve_time - event_time (preferred for real-world timeliness)
- Snapshot/batch features: ready_time - scheduled_ready_time, or now - last_update_time
- Online store freshness: now - last_write_time (per key or aggregate)
How to set a Freshness SLA (step-by-step)
- Map decisions to freshness: What is the latest acceptable age before predictions degrade?
- Choose SLIs: e.g., p95(now - event_time) at serving; batch lateness in minutes; % keys fresher than TTL.
- Pick SLOs: e.g., p95 age ≤ 5 min (stream), daily batch completed by 03:30 UTC, ≥ 99% keys updated within 24h.
- Instrument: add timestamps to outputs; emit metrics per run/partition and per key bucket; record distributions (p50, p95, p99).
- Alert: page on hard SLA breaches; ticket on SLO trend regression; annotate planned maintenance/backfills.
- Runbook: include triage steps, owners, rollback/fallback, and communication template.
Runbook starter (copy/paste)
Signal: Batch job X late by >30 min Check: Scheduler status → upstream source lag → recent code changes → infrastructure errors Mitigate: Retry pipeline; reprocess missing partition; activate fallback features; widen cache TTL temporarily Comms: Post status, ETA, and impacted models; annotate dashboard Postmortem: Root cause, prevention item, SLA impact window
Worked examples
Example 1: Daily batch feature
Feature "user_daily_spend" should be ready by 03:00 UTC. SLA: completed by 03:30.
- Yesterday completed 03:12 → on time
- Today at 03:41 still no partition → SLA breach
SLI: ready_time - 03:00. SLO: ≤ 30 min. Action: alert at 03:31 if missing; runbook: check upstream partition for D-1.
Example 2: Streaming feature age
Feature "clicks_last_5m" from Kafka. SLO: p99(now - event_time) ≤ 5 min.
If a broker slowdown adds 4 minutes, the p99 jumps to 7 min. You alert even if average looks fine.
Example 3: Online store TTL
Feature "merchant_risk_score" refreshed hourly; TTL = 24h. Guardrail: <= 1% of keys older than 24h.
If 6% keys exceed TTL after an upstream outage, you trigger a rebuild of affected segments and enable a fallback model.
Monitoring blueprint: what to measure
- Age distribution at serving: now - event_time (p50/p95/p99)
- Batch timeliness: actual_ready - scheduled_ready; % partitions on time
- Online store currency: % keys within TTL; max age; age histogram
- Upstream lag: source watermark delay; queue lag
- Coverage: % of predictions using fallback due to staleness
- Annotations: deployments, backfills, maintenance windows
Self-serve checks
- Can you compute age with only your written metadata?
- Do you track per-key or per-segment, not just global averages?
- Are timezones and clock sync documented?
Implementation patterns
Batch (daily/hourly)
- Emit run metadata: scheduled_time, start_time, end_time, output_partition
- Write a control record into a "runs" table and a final checkpoint file
- Alert on missing partition or lateness threshold
Streaming
- Carry event_time through the pipeline
- At serving, log now - event_time for a sample of requests
- Aggregate p95/p99 over 5–15 min windows
Online store
- Store last_update_time per key or maintain a side index
- Sample keys periodically to estimate age distribution
- TTL guardrails + targeted backfills
Exercises (do these now)
Try the tasks below. A checklist follows to confirm your work.
Exercise 1: Batch timeliness SLA
You have a daily job with scheduled_ready_time 03:00 UTC and SLA 30 minutes. Data for the last 3 days:
| run_date | scheduled_ready_time | actual_ready_time |
|---|---|---|
| 2026-01-01 | 03:00 | 03:12 |
| 2026-01-02 | 03:00 | 03:28 |
| 2026-01-03 | 03:00 | 03:41 |
Task: compute lateness (minutes) and flag SLA breach per day. Decide alert severity for 2026-01-03.
Show solution
Lateness: 12, 28, 41 minutes. SLA 30 min → breaches on 2026-01-03 only. Severity: page if user-facing predictions depend on it; otherwise create an incident, trigger retry, and communicate ETA.
Exercise 2: Streaming age percentiles
Sample serving logs (now - event_time in minutes): [1, 2, 2, 3, 3, 4, 5, 7, 8, 10]. SLO: p99 ≤ 5 min.
Task: estimate p95 and p99; determine SLO status.
Show solution
Sorted values are already shown. p95 ~ between 9th and 10th values ≈ 9.5th percentile → about 10. p99 ~ near max → about 10. Both > 5 → SLO violated. Investigate source lag or consumer slowdown.
Self-check checklist
- I can compute lateness from run metadata
- I can estimate p95/p99 age from serving logs
- I know when to page vs create a ticket
- I can explain event_time vs processing_time to a teammate
Common mistakes and how to self-check
- Using processing_time instead of event_time → underestimates age during upstream delays. Self-check: compare both on a sample; differences > 1–2 minutes in stream suggest issues.
- Only tracking averages → hides tail problems. Self-check: add p95/p99 and max.
- No per-key segmentation → hotspots hidden. Self-check: stratify by tenant/region/feature group.
- Timezone/clock drift → false alerts. Self-check: UTC everywhere; NTP enabled; log clock skew.
- Alert spam during planned backfills → add maintenance annotations and temporary threshold overrides.
- Conflating training and serving freshness → measure both; don’t assume one implies the other.
Practical projects
- Build a batch freshness monitor: a small table of runs with a daily email summary and a simple HTML dashboard.
- Implement a streaming age logger at serving and export p95/p99 metrics.
- Create an online-store TTL scanner that samples keys hourly and reports % older than TTL.
- Write a one-page runbook for your most critical feature pipeline.
Quick Test
Take the quick test below to check mastery. Available to everyone; if you are logged in, your progress will be saved.
Next steps
- Add freshness annotations to your dashboards (deploys, backfills)
- Align SLOs with model owners and data producers
- Automate postmortems and weekly SLO reviews
Mini challenge
Your streaming feature meets p95 ≤ 3 min but violates p99 during peak hours. In a short plan, propose: which metrics to add, what alert to change, and one mitigation to reduce tail latency. Keep it to 5 bullet points.