Why this matters
Online feature serving is the heartbeat of real-time ML. If it is slow or flaky, your model predictions are delayed, timeouts rise, and user experience suffers. As an MLOps Engineer, you will be asked to keep latency low (especially p95/p99), keep throughput high, and ensure freshness of features. Daily tasks include:
- Designing and meeting latency SLOs for feature retrieval (for example, p95 < 20 ms).
- Choosing cache strategies (TTL, warmup) and connection pooling to avoid cold starts.
- Batching multi-key reads to prevent N+1 queries.
- Setting timeout budgets across services (auth, feature store, model) to limit retries and tail latency.
- Monitoring hit ratios, QPS, error rates, and autoscaling to handle traffic spikes.
Who this is for
- MLOps Engineers running or integrating an online feature store.
- Data/Platform Engineers helping serve features to model APIs.
- ML Engineers optimizing end-to-end inference latency.
Prerequisites
- Basic networking (latency, bandwidth, TCP connections).
- Understanding of key-value stores and caches.
- Familiarity with latency metrics (p50, p95, p99) and SLOs.
Concept explained simply
Online serving is a fast lookup: given entity keys (like user_id), return the latest features. Speed matters more than perfect completeness. Most systems use a layered approach:
- Client calls feature service with keys.
- Service checks in-memory/nearby cache. If hit: return quickly.
- If miss: fetch from online store (often a low-latency KV store), maybe apply light transformations, then return and backfill the cache.
Key goals: low tail latency, high cache hit ratio, stable throughput, and correct freshness.
Mental model
Think in terms of a fixed latency budget. Every hop eats budget: network hop + cache + store + serialization + any transformation. You optimize by:
- Reducing hops (co-locate services, use multi-get instead of many single gets).
- Reducing work per hop (efficient serialization like gRPC, lean schemas).
- Avoiding redundant work (warm caches, batch requests).
- Protecting the system (timeouts, backoff, circuit breakers, fallbacks).
Key metrics and budgets
- Latency percentiles: p50 (typical), p95/p99 (tail). Optimize for tail.
- Throughput/QPS: sustained and peak requests per second.
- Error rate/timeouts: should be near zero at your SLO.
- Cache hit ratio: higher hit ratio reduces average and tail latency.
- Resource: CPU, memory, connection pool saturation.
- Budgeting: split end-to-end latency SLO across components. Keep a safety buffer (for example 10–20%).
Example latency budget template
- End-to-end SLO p95: 100 ms
- Client-network budget: 10–15 ms
- Auth/checks: 5–10 ms
- Feature service: 25–35 ms
- Model inference: 35–45 ms
- Buffer: 10–15 ms
Common bottlenecks
- N+1 queries: fetching each feature table separately per request. Fix with multi-get/batched lookups or pre-joined feature views.
- Cold cache: low hit ratio causes spikes in tail latency. Fix with warmup, TTL tuning, and hotspot protection.
- Serialization overhead: JSON can be costly. gRPC or compact formats reduce CPU and latency.
- Network fan-out: many downstream calls increase p95/p99. Consolidate or fetch in parallel with budgets.
- Connection thrashing: missing pooling leads to slow TLS handshakes and kernel limits. Use pools.
Worked examples
Example 1 — Impact of cache hit ratio on average latency
Given: cache hit ratio 82%, cache hit = 3 ms, cache miss path = 40 ms. Expected average latency:
Average = 0.82 × 3 + 0.18 × 40 = 2.46 + 7.20 = 9.66 ms.
Observation: Improving hit ratio from 82% to 90% drops average to 0.9 × 3 + 0.1 × 40 = 3 + 4 = 7 ms.
Example 2 — Batch vs per-request fetching
Scenario: 10 single-key reads each cost 4 ms RPC overhead + 1 ms store access = ~5 ms per key sequentially → ~50 ms.
Batching all 10 keys in one multi-get costs 4 ms RPC overhead + 3 ms store access for the batch = ~7 ms total.
Result: Latency drops from ~50 ms to ~7 ms and reduces p95 tail risk.
Example 3 — Timeout budget splitting
End-to-end SLO p95: 120 ms. Known components: Model = 50 ms p95, Auth = 10 ms p95, Network (both ways) = 20 ms p95. Remaining budget = 120 − 50 − 10 − 20 = 40 ms for feature retrieval. Keep a buffer of 10 ms.
Set feature store p95 target: 30 ms. Configure client timeout slightly above SLO (for example 140–150 ms) to account for variance, but enforce tighter internal timeouts: feature service client timeout ~40 ms.
Example 4 — Avoiding N+1 queries
Problem: Request needs 3 feature tables and does 3 sequential calls, each p95 ~12 ms, total ~36 ms. With parallel fan-out, total ≈ max(12,12,12) + overhead ≈ 14–16 ms, but still 3 calls.
Better: Pre-join into a feature view or use a single multi-get across keys and tables to avoid multi-hop overhead, often bringing p95 to <10–12 ms.
Practical checklist
- Define p95 and p99 SLOs for end-to-end and for feature retrieval.
- Enable connection pooling and keep-alives.
- Use batching or multi-get for multiple keys or tables.
- Warm the cache on deploy; set realistic TTLs.
- Measure and alert on cache hit ratio and error rate.
- Set timeouts and circuit breakers; add sensible fallbacks.
- Track capacity headroom (at least 30% for p95 steady-state).
Common mistakes and self-check
- Mistake: Optimizing p50 only. Self-check: Track p95/p99 and compare against SLO weekly.
- Mistake: Ignoring connection pools. Self-check: Observe new connections/sec and TLS handshakes during load; should be stable and low.
- Mistake: Excessive JSON payloads. Self-check: Measure payload sizes; aim for compact schemas or gRPC.
- Mistake: Misaligned timeouts (downstream > upstream). Self-check: Ensure downstream timeouts are lower than the caller’s timeout.
- Mistake: Low cache hit ratio after deploy. Self-check: Validate warmup job and TTL fit access patterns.
Exercises
Do these to solidify concepts. Compare with the solutions after trying.
Exercise 1 — Cache math quick check (id: ex1)
You serve 1,000 requests. Hit ratio = 80%. Hit latency = 2 ms. Miss latency = 35 ms. Estimate the average latency and the total time spent serving all requests.
- Round to two decimals.
Exercise 2 — Timeout budget (id: ex2)
Your end-to-end latency SLO (p95) is 100 ms. Model p95 = 40 ms, auth p95 = 8 ms, network (both directions) p95 = 18 ms. Propose a p95 target for the feature retrieval and suggest client and feature-client timeouts that respect a 10 ms safety buffer.
Show checklist before checking solutions
- Did you compute weighted averages for mixed hit/miss cases?
- Did you keep a safety buffer when splitting latency budgets?
- Are downstream timeouts less than or equal to the caller’s timeout?
Practical projects
Build once, iterate fast.
- Minimal Online Feature Service
- Use a local key-value store and a simple REST or gRPC layer.
- Implement multi-get for a batch of keys.
- Add an in-memory cache with TTL and warmup routine.
- Measure p50/p95 at 50, 200, 500 RPS using a load tool; record hit ratio and error rate.
- Latency Budget Enforcer
- Add per-hop timers (auth, cache, store, serialization).
- Expose metrics for budgets and alert if feature retrieval p95 exceeds target for 5 minutes.
- Implement circuit breaker that returns fallback features if store exceeds timeout.
- Batching vs Single-Get Study
- Compare sequential single-key reads vs multi-get for 10 keys.
- Plot average and p95; write a short note on savings and when batching helps or hurts.
Learning path
- Start here: Online Serving Performance Basics (this page).
- Next: Caching strategies and TTL tuning.
- Then: Materialization and feature freshness guarantees.
- Advanced: Autoscaling policies, load shedding, and circuit breakers.
- Expert: Cross-region active-active serving and disaster recovery.
Next steps
- Finish the exercises and mini challenge.
- Run a small load test locally or in a sandboxed environment.
- Tune one lever (batch size, connection pool, or serialization) and re-measure p95.
Mini challenge
Your p95 climbed from 12 ms to 25 ms after traffic doubled. CPU is fine, but new connections/sec spiked, and cache hit ratio dropped from 92% to 78%.
- List three concrete actions to recover p95 below 15 ms.
- Prioritize them and explain why.
Possible directions
- Increase connection pool and enable keep-alives to reduce handshakes.
- Warm the cache and adjust TTL to fit access patterns.
- Switch to multi-get or increase batch size for hot paths.
- Add autoscaling and a short-term traffic-shaping rule.
Quick Test
The quick test is available to everyone for free. Only logged-in users have their progress saved.