Why this matters
In production ML, features lose value as they age. Feature stores help you compute and serve features, but you must define how fresh those features must be and guarantee it via SLAs (service level agreements). Poor freshness can tank model performance, cause bad user experiences, and break real-time decisions.
- Fraud detection: velocity features must reflect the last seconds of activity.
- Recommendations: inventory and user activity features must update within minutes.
- Forecasting: daily aggregates need consistent end-of-day arrival times.
- A/B tests: comparable freshness across variants avoids biased evaluations.
Concept explained simply
Feature freshness is how up-to-date a feature value is at the time you read it. A simple way to think about it:
Freshness at read time = now_at_read - event_timestamp_used_to_compute_feature.
SLA (Service Level Agreement) is the target promise for freshness and availability. You usually track it through SLOs (objectives) and SLIs (indicators/metrics). Example: P99 freshness 5s for transaction velocity feature during business hours.
Mental model: the freshness pipeline
Imagine a conveyor belt with three delays:
- Compute delay: time to aggregate or transform events.
- Transport delay: time to write into the feature store/serve online.
- Read delay: caching/serving time until the model reads it.
Your freshness budget is how much total delay you can tolerate. If your decision must be made within 300 ms, and model scoring takes 120 ms, network takes 80 ms, you have ~100 ms left for feature availability or a fallback strategy.
Key terms and practical definitions
- Event time: when something actually happened (e.g., transaction time). Use this for accuracy.
- Ingestion time: when data arrived in your system. Can be later than event time.
- Freshness window: maximum allowed age for feature values at read time.
- TTL (time-to-live): how long a feature value is served before it is considered stale and purged or replaced.
- Point-in-time correctness: training features must be built using only data available at that historical moment (no leakage).
Worked examples
Example 1: Real-time fraud detection
Feature: number_of_transactions_last_60s per card.
- Requirement: decisions in 300 ms after swipe.
- Freshness target: at read, feature reflects all events up to 2 seconds ago.
- SLA: 99% of reads see data no older than 2s; 99.9% no older than 5s. Availability 99.9%.
- Design: streaming aggregation with watermarks; online store TTL 2 minutes; fallback to last known value if freshness > 5s.
Why it works: streaming keeps latency low; TTL prevents stale buildup; fallback preserves continuity during spikes.
Example 2: Daily churn model
Feature: sessions_last_7_days, updated once per hour.
- Requirement: batch scoring nightly 01:00.
- Freshness target: by 00:30, all prior day events included.
- SLA: By 00:30, P99 of features reflect data up to 23:59:59 previous day; backfill late events by 04:00.
- Design: hourly micro-batches + late-arrival backfill; offline store partitions by date; training uses point-in-time semantics.
Outcome: Consistent training-serving behavior without strict sub-second needs.
Example 3: Recommendations CTR features
Feature: rolling_ctr_15m.
- Requirement: homepage loads in 200 ms.
- Freshness target: feature includes clicks/impressions up to 5 minutes ago at P95.
- SLA: P95 freshness 5m, P99 10m; availability 99.95%.
- Design: incremental window updates via stream; online store with per-key TTL 30m; serve cached features if newer than 10m; otherwise degrade to category-level CTR.
Trade-off: tighter freshness raises infra cost; fallback keeps UX stable during spikes.
Designing SLAs in 5 steps
Step 1 Define decision need
What decision uses this feature and how quickly must it react?
Step 2 Set a freshness budget
Split your total latency among model scoring, networking, and feature readiness.
Step 3 Choose SLIs
Percentile freshness at read (P50/P95/P99), error rate, availability, and staleness rate (reads exceeding window).
Step 4 Write SLOs
Example: P99 freshness 5s over rolling 30d; staleness rate < 0.5%.
Step 5 Plan fallbacks and alerts
Define TTLs, backfills, fallback features, and paging thresholds.
Measuring freshness
Track the timestamp used to compute each feature value (event_time or watermark_time). On read, compute freshness_age = now() - feature_timestamp. Emit this as a metric with percentiles.
Implementation tips (tech-agnostic)
- Include both event_time and produced_at in feature values for diagnostics.
- Use watermarks for late data; update features idempotently.
- Store last_update_time per key and expose it in online reads for optional client-side guards.
- Backfill jobs should not break point-in-time correctness for training.
Common mistakes and self-check
- Mistake: Measuring freshness from ingestion_time only. Self-check: verify you use event_time for business correctness.
- Mistake: Using averages instead of percentiles. Self-check: ensure P95/P99 are tracked; spikes hide in averages.
- Mistake: No fallback when freshness breached. Self-check: define and test a degraded but safe feature set.
- Mistake: TTL too long. Self-check: simulate incident; confirm stale data is not served indefinitely.
- Mistake: Training-serving skew from late arrivals. Self-check: enforce point-in-time joins for training and re-materialize training sets after backfills.
Practical projects
- Build a streaming counter feature with per-key freshness metrics and a dashboard showing P50/P95/P99.
- Create a batch aggregate (daily revenue by merchant) with a backfill job and a freshness SLA (delivery by 01:00 at P99).
- Implement a fallback: when rolling_ctr_15m is older than 10m, serve category_ctr_1h and log the swap.
Exercises
Do the two exercises below to practice setting and validating freshness SLAs. Use the checklist as you work.
- State the decision need and latency budget.
- Write SLIs (what you measure).
- Set SLO targets (percentiles and thresholds).
- Define TTL, backfill, and fallback.
- Describe your alerting and on-call boundaries.
Mini challenge
You manage a price_sensitivity_score feature updated from clickstream and purchases. Pages must load in 250 ms. Streaming updates add ~150 ms at P95; model scoring is 70 ms P95; network is 30 ms P95.
- What freshness window can you afford at P95?
- Draft a one-line SLO for P99 freshness.
- Suggest a fallback if the freshness window is breached.
Suggested direction (spoiler)
Budget ~250 - 70 - 30 = 150 ms for data readiness at P95. Example SLO: P99 freshness 500 ms over 30d. Fallback: serve previous score if updated < 10m, else cohort-average.
Who this is for
- Machine Learning Engineers owning online/offline features.
- Data/Platform Engineers supporting feature pipelines.
- Applied Scientists who need reliable real-time signals.
Prerequisites
- Basic ML pipeline knowledge (training vs serving).
- Comfort with streaming or batch data processing concepts.
- Understanding of percentiles and latency metrics.
Learning path
- Understand feature lifecycle (ingest compute store serve).
- Define SLIs/SLOs for freshness and availability.
- Implement measurement and dashboards.
- Add TTL, backfill, and fallback mechanisms.
- Run incident drills and tune thresholds.
Next steps
- Instrument your current top-3 features with freshness timestamps.
- Propose SLAs with P95/P99 targets and review with stakeholders.
- Add a simple fallback and test it in staging with injected delays.
Quick Test is available to everyone; if you log in, your progress is saved.