Who this is for
- MLOps Engineers managing model serving, batch pipelines, or feature stores.
- Data/ML engineers who need clear reliability targets and alerting rules.
- Teams defining agreements with product or external customers.
Prerequisites
- Basic understanding of ML systems (online inference, batch jobs, data pipelines).
- Familiarity with metrics like latency, error rate, throughput, and model quality signals.
- Comfort reading simple percentages and time windows.
Why this matters
SLO/SLA and error-budget thinking keeps ML systems reliable without overreacting to every blip. As an MLOps Engineer, you will:
- Define Service Level Indicators (SLIs) such as p95 inference latency, successful request ratio, data freshness delay, or batch completion rate.
- Set realistic SLOs (targets) that align reliability with delivery speed.
- Translate SLOs into alert policies using error budgets and burn rates.
- Decide when to pause releases if the error budget is exhausted.
- Explain reliability trade-offs to product and stakeholders.
Concept explained simply
- SLI: The measurement. Example: percentage of requests under 120 ms over the last 30 days.
- SLO: The target for the SLI. Example: 99.5% of requests under 120 ms in a 30-day window.
- SLA: A public or contractual promise. Missing it may trigger penalties or credits. It is usually looser than your internal SLO.
- Error Budget: How much unreliability you can afford while still meeting the SLO. If SLO is 99.5%, the error budget is 0.5% over the window.
- Burn Rate: How fast you are spending the error budget. If you burn faster than expected, alerts trigger before the window ends.
Quick formulas
- Error Budget (%) = 100% − SLO (%)
- Budget per window = (Error Budget %) × (Total events or total time in window)
- Burn Rate ≈ (Observed error rate) ÷ (Allowed error rate). Example: If SLO is 99.9% (0.1% budget), and current error rate is 0.3%, burn rate = 0.3 / 0.1 = 3×.
Mental model
Think of your reliability as a battery. Each error or slow request drains the battery (error budget). If you drain it too fast (high burn rate), you slow down experiments and releases to protect users. If the battery is full, you can take more risks (ship faster).
Designing SLIs and SLOs for ML Systems
- Online inference:
- SLIs: p95 latency, successful request ratio, timeouts, rate of 5xx, model freshness (age of deployed model), feature store read success.
- Typical SLOs: 99.9% success over 30 days; p95 latency < 120 ms for 99% of requests.
- Batch pipelines:
- SLIs: Job success ratio, data freshness delay, pipeline duration predictability.
- Typical SLOs: 99% of daily jobs finish within 2 hours; 99% of features updated within 15 minutes of schedule.
- Model quality (use carefully):
- SLIs: Data drift within threshold, accuracy/AUC above threshold measured on recent labeled samples.
- Typical SLOs: 95% of weekly checks show drift < threshold; AUC ≥ 0.85 in 4 of 5 recent evaluation windows.
Window choices and alerting
- Rolling vs calendar windows: Rolling (last 30 days) reacts more smoothly; calendar months are simpler for reporting.
- Multi-window alerts: Pair a short window (fast detection) with a long window (avoid noise). Example: alert if burn rate > 14× over 5 minutes OR > 6× over 1 hour.
- Error budget policy: If > 50% of budget is spent early in the window, slow down releases; if 100% spent, freeze risky changes.
Worked examples
1) Inference latency SLO
SLO: 99% of requests have latency < 120 ms in a 30-day window. Error budget = 1% of requests can be slower than 120 ms.
Example numbers
- Requests in 30 days: 100,000,000
- Allowed slow requests (budget): 1% = 1,000,000
- Yesterday slow requests: 50,000 → consumed 5% of the monthly budget in one day.
- Burn rate vs expected: Expected average per day = 1,000,000 / 30 ≈ 33,333. Yesterday = 50,000 → burn rate ≈ 50,000 / 33,333 ≈ 1.5×
- Action: Investigate cause; consider rate limiting or rollbacks if trend continues.
2) Batch pipeline on-time delivery
SLO: 99% of daily feature jobs complete within 2 hours. Window: 30 days.
Example numbers
- Total jobs in window: 3,000
- Error budget = 1% = 30 late/failed jobs allowed
- Last 48 hours late/failed: 12
- Expected per 2 days to stay on track: 30 / 30 × 2 = 2
- Burn rate ≈ 12 / 2 = 6× → Trigger high-urgency alert.
3) Data freshness for feature store
SLO: 99.5% of feature reads return data newer than 15 minutes.
Example numbers
- Reads in 7 days: 10,000,000
- Error budget = 0.5% = 50,000 stale reads allowed
- Observed last hour: 12,000 stale reads out of 200,000 reads → 6% stale
- Allowed rate = 0.5% → Burn rate = 6 / 0.5 = 12× → Page on-call quickly; prioritize data ingestion fix.
Step-by-step: Apply error budget thinking
- Pick 1–2 critical SLIs per service (latency and success for inference; on-time completion for batch).
- Set SLOs that users feel. Start reasonable (e.g., 99% or 99.5%) and adjust using real data.
- Compute the error budget and define an error budget policy (what to do at 50%, 75%, 100% burn).
- Create multi-window burn-rate alerts (fast + slow windows).
- Review incidents monthly; tune SLOs and alert thresholds.
Ready-to-use alert sketch
- Page if burn rate > 14× over 5 minutes (clear outage).
- Alert (non-page) if burn rate > 6× over 1 hour (ongoing degradation).
- Ticket if burn rate > 2× over 24 hours (chronic issue).
Common mistakes and self-check
- Mistake: Too many SLIs. Self-check: Can you explain your service reliability with 1–2 charts?
- Mistake: Alerting on raw metrics only. Self-check: Do alerts reflect burn rate vs budget, not just spikes?
- Mistake: Unrealistic SLOs. Self-check: Would your SLO have been met by your last 90 days of history?
- Mistake: No policy tied to budget. Self-check: What happens when you hit 100% budget burn?
- Mistake: Using averages for latency. Self-check: Are you using percentiles (p95/p99) for user-perceived performance?
Exercises
Do these now. They mirror the tasks below the article and help you internalize burn-rate thinking.
Exercise 1 (matches ex1)
Compute error budget and burn rate for an inference API with SLO 99.9% success over 30 days. Yesterday: 2,500 failures out of 1,000,000 requests. Is this a high-burn day?
Exercise 2 (matches ex2)
Design a minimal SLO set for a batch pipeline that updates features hourly. Include SLIs, SLO targets, windows, and an error budget policy.
- Checklist:
- Defined at least 1 availability SLI and 1 timeliness SLI.
- Chose rolling or calendar windows and justified why.
- Wrote actions for 50%, 75%, and 100% budget burn.
Practical projects
- Project A: Add SLOs to an inference service
- Define SLIs: p95 latency and success ratio.
- Set SLOs: 99.9% success; 99% of requests < 120 ms (30-day rolling).
- Implement burn-rate alerts (fast: 5m, slow: 1h, daily).
- Write the error budget policy and share with the team.
- Project B: Batch reliability dashboard
- Track job success, schedule delay, and data freshness.
- Set SLOs: 99% jobs on-time; 99.5% freshness < 15 minutes (30-day).
- Automate reports that show budget remaining and trend.
Learning path
- Start with one service and two SLIs.
- Run for two weeks; collect data and adjust thresholds.
- Add multi-window burn-rate alerting.
- Introduce error budget policy in your deployment process.
- Expand to other services and consider SLAs if external customers rely on it.
Next steps
- Finish the exercises and take the quick test.
- Apply one SLO in your sandbox or staging system today.
- Schedule a 30-minute review with your team to agree on an error budget policy.
Mini challenge
Pick a single user-visible failure mode (e.g., timeout errors). Define an SLI, an SLO, and a two-tier burn-rate alert. Write the exact paging rules you would use.
Quick Test
Available to everyone. Only logged-in users will have their progress saved.