Why this matters
Reliability is the product feature customers notice most when it’s missing. As a Platform Engineer, you design shared systems (CI/CD, service mesh, observability, databases) that other teams rely on. Service Level Objectives (SLOs) turn vague reliability wishes into measurable goals. Error budgets translate those goals into controlled risk and deployment speed.
- Prioritize engineering work: use error budget burn to justify reliability work over new features.
- Create aligned alerts: page only for customer-impacting issues.
- Negotiate expectations: SLOs help product, support, and engineering agree on what “good” means.
Who this is for
- Platform and backend engineers building and running services.
- Team leads/SREs defining reliability guidelines.
- New engineers learning how to measure and defend reliability choices.
Prerequisites
- Basic understanding of HTTP services and request/response flow.
- Familiarity with metrics (counters, histograms), logging, and alerting concepts.
- Comfort with percentages and time windows.
Concept explained simply
Think of reliability as a promise to users. An SLO is the promise, measured by an SLI (what you actually measure). The SLA is a legal or external document and is not required for internal reliability practice.
- SLI (Service Level Indicator): a precise measurement. Example: percentage of requests under 300 ms and successful.
- SLO (Service Level Objective): the target for the SLI. Example: 99.9% of requests under 300 ms and successful over 30 days.
- Error budget: how much unreliability is allowed. If your SLO is 99.9%, your error budget is 0.1% of time/requests.
Mental model
Reliability trades off with velocity. Your SLO sets the maximum acceptable risk. If you burn error budget too fast, you slow changes and focus on fixes. If you have budget left, you can ship faster.
Quick intuition: time-based vs request-based SLOs
Time-based (availability per month): great for infrastructure like control planes. Request-based (success/latency per request): ideal for APIs and user-facing endpoints. Choose what best reflects user experience.
SLI, SLO, SLA — practical differences
- SLI: formula only. Example: successful_requests_under_300ms / total_requests (rolling 30 days).
- SLO: target + window. Example: SLI ≥ 99.9% over 30 days.
- SLA: external commitment with consequences. Often derived from SLOs but not needed for internal practice.
Setting SLOs step-by-step
- Define the user journey. What is the critical path? (e.g., Checkout POST).
- Choose SLIs that reflect user happiness. Success + latency percentiles.
- Pick a sensible window. 28–30 days balances stability and responsiveness; shorter windows for volatile services.
- Set the target. Start with what you can meet. 99–99.9% is common for single services; 99.99% requires strong redundancy.
- Compute error budget. Budget = 1 − SLO target.
- Design alerts on burn rate. Alert only when the budget is burning fast enough to threaten the SLO.
- Agree on policies. If budget is exhausted: pause risky deploys, focus on reliability.
Tip: pick availability OR latency first
New teams often start with availability success rate (simple) before adding latency SLOs (harder to measure and tune).
Worked examples
Example 1: Monthly availability SLO
Target: 99.9% availability over 30 days for the API gateway.
- Total minutes in 30 days = 30 × 24 × 60 = 43,200.
- Error budget = 0.1% = 43.2 minutes of allowed unavailability.
- Alerting: use burn rate alerts.
- Fast page: 1-hour window, burn rate ≥ 14.4. (Consumes ~10% of monthly budget in 1 hour.)
- Slow page: 6-hour window, burn rate ≥ 6.
Why these numbers?
Burn rate = (errors allowed per month) / (errors allowed per alert window). Thresholds ensure you don’t sleepwalk into missing the SLO.
Example 2: Latency SLO for a POST /checkout
SLI: fraction of successful requests under 300 ms (exclude 4xx client errors).
SLO: 95% of successful requests under 300 ms over 7 days.
Why 7 days? Checkout traffic is spiky; a shorter window reacts faster to regressions.
Edge cases
- Retry storms: cap client retries to avoid artificial failure inflation.
- Cold starts/warm-up: treat planned cold starts as part of reliability, not exclusions, unless strictly necessary and documented.
Example 3: Background jobs throughput SLO
SLI: fraction of jobs completed within 5 minutes of enqueue.
SLO: 99% of jobs complete within 5 minutes over 30 days.
Alerting: if the 99th percentile age of jobs exceeds 5 minutes for 30 minutes, open an incident; if sustained for 2 hours, escalate.
Capacity planning insight
When SLO fails during peak, you likely need queue autoscaling or rate limiting on producers.
Error budgets: turning targets into decisions
- Budget left: deploy at normal speed; try experiments.
- Budget burning fast: tighten change control, prioritize rollbacks and fixes.
- Budget exhausted: freeze risky changes; focus on reliability work until budget recovers.
Simple policy template
If 25% budget consumed in first week: review on-call toil and top incidents. If 50% by mid-cycle: restrict deploys to fixes. If 75%: freeze features. If 100%: escalate to leadership and prioritize reliability roadmap.
Measurement and alerts
- Metrics needed: total requests, successful requests, latency buckets (histograms), queue age/length, job completion counts.
- Rollups: compute SLIs over sliding windows (e.g., 30 days) and shorter alert windows (1h, 6h).
- Exclude only what users don’t experience (e.g., maintenance pages during agreed downtime) and document exclusions clearly.
Sample SLI expressions (illustrative)
- Availability SLI: successful_requests / total_requests (rolling 30d).
- Latency SLI: requests_under_300ms / successful_requests (rolling 7d).
- Job SLI: jobs_completed_within_5m / jobs_enqueued (rolling 30d).
Common mistakes and self-check
- Too many SLOs per service. Pick 1–3 that reflect user happiness. Self-check: can you explain each SLO’s user impact in one sentence?
- Paging on raw errors. Page on budget burn, not every blip. Self-check: does this alert imply we’ll miss the SLO soon?
- Hiding behind exclusions. Excluding every tough case gives fake reliability. Self-check: would users agree this time shouldn’t count?
- Unowned SLOs. Without owners, no one reacts. Self-check: who reviews SLOs weekly? Is a policy documented?
- Targets too high, too soon. 99.99% without redundancy is wishful. Self-check: do we have N+1 and failover tested?
Exercises
These mirror the exercises section below. Try here first; reveal solutions only to verify.
Exercise 1: Compute error budget and burn alerts
You run an API with SLO: 99.9% availability over 30 days.
- Task A: How many minutes of downtime are allowed?
- Task B: Propose two burn-rate alerts that would catch fast and slow burns.
Show a small hint
30 days = 43,200 minutes. Error budget is 0.1% of that. For alerts, think 1-hour and 6-hour windows with burn rates around 14.4 and 6.
Reveal solution
Allowed downtime: 43.2 minutes per 30-day window. Alerts: page if 1-hour burn rate ≥ 14.4 (about 10% of monthly budget in 1 hour). Page if 6-hour burn rate ≥ 6.
Exercise 2: Define a latency SLI/SLO for POST /checkout
- Task A: Write an SLI that measures user-perceived latency and ignores client mistakes.
- Task B: Propose an SLO target and window.
- Task C: Suggest a paging condition linked to budget burn or sustained p95 latency.
Show a small hint
Filter out 4xx, include only successful responses, use a percentile threshold like 300 ms.
Reveal solution
SLI: requests_under_300ms_success / successful_requests. SLO: 95% under 300 ms over 7 days. Paging: if 7-day burn threatens SLO (e.g., p95 over 300 ms for 30 min and 2h windows) or burn rate ≥ threshold.
Practical projects
- Instrument one service: add counters for total/successful requests and a latency histogram; publish a 30-day availability and a 7-day latency SLI panel.
- Define SLOs for a queue worker: 99% jobs complete within 5 minutes; add age and success metrics; configure two burn-rate alerts.
- Run a game day: simulate a dependency outage; observe burn rate, alerts, and document policy decisions.
Learning path
- Start: availability SLO for your most critical endpoint.
- Next: latency SLO on success-only requests.
- Then: background processing SLO (throughput/age).
- Finally: introduce burn-rate alerting and a written error budget policy.
Mini challenge
Your service depends on an upstream that randomly adds 100 ms latency 5% of the time.
- Propose an SLI and SLO that keep user experience good without over-alerting.
- Decide whether to include upstream-caused latency in your SLI.
- Draft one short policy line triggered when 25% of budget is gone in the first week.
One possible direction
Include upstream effects (users feel them). Choose 95% under 350 ms over 7 days; page on p95 over 350 ms for 30 min and burn ≥ threshold; policy: cap deploys to fixes until burn stabilizes.
Next steps
- Write down 1–2 SLOs for your service and share with your team.
- Implement one burn-rate alert and track it for a week.
- Schedule a 30-minute review to adjust thresholds based on real behavior.
Quick test
Take the quick test below to check your understanding. The test is available to everyone; progress is saved only if you are logged in.