Why this matters
As a Platform Engineer, you balance feature velocity with reliability. SLOs (Service Level Objectives) translate user expectations into measurable targets; SLIs (Service Level Indicators) provide the measurements; error budgets define how much unreliability you can afford before reliability must take priority. This helps you:
- Decide when to ship versus when to stabilize.
- Design pragmatic alerts that page only when users are impacted.
- Report reliability clearly to product and leadership.
- Drive incident response and post-incident improvements with clear, shared goals.
Concepts explained simply
Core definitions
- SLI (Service Level Indicator): A measurement of what users care about, like availability, latency, or correctness. Example: 95% of queries return under 1.5s.
- SLO (Service Level Objective): The target for an SLI over a time window. Example: Over 30 days, 99.9% availability.
- Error budget: The permissible amount of unreliability. Error budget = 1 − SLO. If SLO is 99.9%, the error budget is 0.1%.
- SLA (optional contract): A legal/business commitment. Keep technical SLOs separate; SLAs are not required to use SLOs.
Mental model
Imagine a bucket filled with your monthly “allowed errors.” Each incident drains the bucket. If it empties, you pause risky changes and focus on reliability. Burn rate tells you how fast the bucket is draining: burn rate = current error rate ÷ error budget fraction.
Choosing good SLIs
Good SLIs are user-centric, measurable, and controllable by your team.
- Availability: Successful requests ÷ total requests (from user perspective).
- Latency: Percent of requests under a threshold (use percentiles, not averages).
- Correctness: Percent of responses that are correct/complete.
- Freshness/Throughput (pipelines): Percent of items processed within a time budget.
For an observability platform, consider SLIs like:
- Ingestion success within 2 minutes.
- Query 95th percentile latency under a threshold.
- Alert delivery within 60 seconds.
- Sampling/retention policy adherence (no unexpected drops).
Worked examples
Example 1: API availability SLO
SLO: 99.9% availability over 30 days. Error budget = 0.1% of 30 days.
30 days = 43,200 minutes. Error budget minutes = 0.001 × 43,200 = 43.2 minutes ≈ 43m 12s.
SLI: successful_requests ÷ total_requests (from client perspective).
Burn-rate alerting (multi-window):
- Critical: burn rate ≥ 14.4 in both 5m and 1h windows.
- Warning: burn rate ≥ 6 in both 30m and 6h windows.
Example 2: Log ingestion pipeline
SLI: items processed within 2 minutes ÷ total items.
SLO: 99.5% over 7 days. Error budget = 0.5%.
If you process 200 million items per week, budgeted late/dropped items = 0.005 × 200,000,000 = 1,000,000.
Example 3: Alert delivery
SLI: percent of alerts delivered within 60s from rule evaluation to delivery acknowledgment.
SLO: 99% over 30 days. Error budget = 1%.
Alert when burn rate ≥ 12 (5m and 1h) and warn at burn rate ≥ 4 (30m and 6h).
Example 4: Query latency
SLI: percent of queries with p95 latency under 1.5s from user request to last byte.
SLO: 95% of queries under 1.5s over 28 days. Error budget = 5% of queries may exceed 1.5s.
Setting and monitoring error budgets
- Window choice: Calendar (e.g., monthly) is simple to explain; rolling (e.g., last 30 days) reacts faster. Many teams track both.
- Burn rate: burn_rate = current_error_rate ÷ error_budget_fraction. If SLO is 99.9%, budget fraction is 0.001. At 2% error rate, burn rate = 2% ÷ 0.1% = 20.
- Multi-window alerting: Use a short and a long window together to avoid noisy pages and slow drift. Example pairs: (5m & 1h), (30m & 6h).
- Policy when budget is exhausted: Pause risky launches, prioritize reliability, add safeguards (circuit breakers, autoscaling), and review incidents.
- Communicate: Share SLO status weekly; show budget remaining and top drivers of burn.
Implementation steps
- List user journeys: read logs, run queries, receive alerts.
- Pick SLIs per journey: availability, latency thresholds, delivery time.
- Set SLOs: start conservative (e.g., 99%); tighten with data.
- Instrument: ensure metrics exist from user perspective (client or edge).
- Create dashboards: SLI percent, SLO trend, budget remaining, burn rate.
- Alert rules: multi-window burn rate; paging only for user impact.
Exercises (do these before the test)
Mirror of the exercises below. Use pen and paper or a notes app.
- Exercise 1: Calculate the error budget for a 99.9% monthly SLO and propose two burn-rate alert pairs. Then write one action you will take when budget is at 50% remaining and at 0% remaining.
- Exercise 2: Define SLIs and SLOs for an observability platform covering ingestion, query, and alerting. Include measurement definitions and an error budget policy summary.
- Checklist: Did you show the math? Did you use percentiles, not averages? Do your alerts use two windows? Are actions specific and time-bound?
Common mistakes and how to self-check
- Using averages: Averages hide spikes. Use percentiles (p90/p95/p99). Self-check: Can a short outage be invisible in your metric? If yes, fix it.
- Too many SLIs: Focus on 3–5 that reflect key journeys. Self-check: Can you explain each SLI’s user impact in one sentence?
- Server-only view: Measure from client perspective when possible. Self-check: Do client failures count against your SLI?
- No error budget policy: Define what to do when budget is low or exhausted. Self-check: Is there a playbook step that changes team behavior?
- Single-window alerts: Leads to noise or blind spots. Self-check: Do you have both short and long windows?
- Ignoring dependencies: Third-party outages affect users. Self-check: Do upstream failures reduce your SLI?
Practical projects
- Build an SLO dashboard that shows SLI %, SLO target, error budget remaining, and burn rate for two windows.
- Write a one-page error budget policy that defines actions at 75%, 50%, 25%, and 0% budget remaining.
- Add a synthetic check that measures end-to-end user latency and feeds your SLI.
Mini challenge
Your alerting pipeline evaluates rules every minute. Propose one SLI and one SLO for alert delivery, compute the monthly error budget, and pick critical/warning burn-rate thresholds. Write two lines for the policy when budget reaches 25% and when it exhausts.
See a sample answer
SLI: percent of alerts delivered within 60s of rule evaluation to delivery acknowledgment.
SLO: 99.5% over 30 days. Error budget = 0.5% of alerts.
Critical: burn rate ≥ 12 on 5m and 1h. Warning: burn rate ≥ 4 on 30m and 6h.
Policy: at 25% remaining, freeze non-critical launches and run a reliability review. At 0%, pause all launches and staff mitigation until burn rate is below 1.
Learning path
- Metrics and alerting basics.
- Define SLIs from user journeys.
- Set SLOs and compute error budgets.
- Implement burn-rate alerts and dashboards.
- Run error budget reviews and adjust targets.
Who this is for
- Platform Engineers and SREs who operate shared services.
- Backend Engineers owning production services and on-call.
- Engineering Managers who need reliability reporting.
Prerequisites
- Basic understanding of HTTP services and pipelines.
- Familiarity with metrics (counters, histograms) and percentiles.
- Experience with alerting systems.
Next steps
- Integrate SLO status into deployment gates.
- Add client-side or synthetic SLIs to capture true user experience.
- Adopt monthly reliability reviews using error budget data.
Quick Test
Take the quick test to check your understanding. Available to everyone; log in to save your progress.