Why this matters
As an API Engineer, you must protect user experience while avoiding alert fatigue. Alerting on Service Level Objective (SLO) breaches focuses your attention on user impact, not noisy infrastructure metrics.
- Real tasks you will do: define SLIs/SLOs, set burn-rate alerts, route pages vs. warnings, and write runbooks for fast recovery.
- Results: fewer false alarms, faster incident response, and a clear trade-off between feature velocity and reliability.
Concept explained simply
- SLI: how you measure user experience (e.g., percent of requests that succeed under 300 ms).
- SLO: your target for that SLI (e.g., 99.9% per 30 days).
- Error budget: the allowed “bad” portion. For 99.9%, budget = 0.1% bad events in the window.
- Burn rate: how fast you are spending the error budget. Formula: observed_bad_rate / allowed_bad_rate.
Mental model: a bucket of budget
Imagine your error budget as a bucket slowly leaking. Small drips are fine. A big hole means you’ll run out quickly. Burn rate tells you how big the hole is right now. Multi-window alerts check for both a sudden big hole (fast burn) and a steady leak (slow burn) so you only page when the bucket is actually at risk.
What counts as a “bad” event?
- Availability SLO: HTTP 5xx or timeouts.
- Latency SLO: response time above threshold (e.g., >300 ms).
- Quality SLO: wrong status/business errors (e.g., 409 due to quota? Decide per policy).
Define it clearly so your SLI is stable and meaningful.
Worked examples
Example 1: Availability SLO (99.9% monthly)
- Allowed bad rate = 0.1%.
- In the last 1 hour: 200,000 requests; 800 bad (0.4% bad). Burn rate BR1h = 0.4 / 0.1 = 4.
- In the last 6 hours: 1,200,000 requests; 2,400 bad (0.2% bad). Burn rate BR6h = 0.2 / 0.1 = 2.
- Policy: Page when BR1h ≥ 14 AND BR6h ≥ 7. Warn when BR2h ≥ 2 AND BR24h ≥ 1.
- Outcome: No page. Likely a warning if longer windows confirm.
Example 2: Latency SLO (99% of requests ≤ 300 ms)
- Allowed bad rate = 1%.
- Last 15 min: 96% good → 4% bad. BR15m = 4 / 1 = 4.
- Last 6 hours: 99.2% good → 0.8% bad. BR6h = 0.8 / 1 = 0.8.
- With page policy (BR30m ≥ 14 AND BR6h ≥ 7), no page. Data suggests a short spike—investigate without waking someone at 3am.
Example 3: Quality SLO (successful order placements ≥ 99.5% weekly)
- Allowed bad rate = 0.5%.
- Last 1 hour: 50,000 orders; 600 failures (1.2% bad). BR1h = 1.2 / 0.5 = 2.4.
- Last 12 hours: 600,000 orders; 2,400 failures (0.4% bad). BR12h = 0.4 / 0.5 = 0.8.
- Action: No page. A slow trend might appear—open an investigation ticket and watch longer windows.
How to implement (step-by-step)
- Choose an SLI
Availability: ratio of non-5xx requests.
Latency: ratio of requests ≤ threshold.
Quality: ratio of business-success outcomes. - Set the SLO target and window
Common: 99.9% per 30 days for critical APIs; 99% for internal services. - Compute error budget
Allowed_bad_rate = 1 − SLO_target. Example: 99.9% → 0.1%. - Define burn-rate alerts (multi-window)
- Page: fast + sustained burn, e.g., BR1h ≥ 14 AND BR6h ≥ 7.
- Warn: slower but real, e.g., BR2h ≥ 2 AND BR24h ≥ 1. - Route and label
- Page to on-call with clear severity and service tags.
- Warnings to chat/email during business hours. - Attach a runbook
Include: quick checks, rollbacks, feature-flag toggles, owner contacts, and “when to close.” - Review monthly
Audit false positives/negatives. Adjust thresholds, windows, and SLI definitions.
Choosing windows that work
- High-traffic, low-latency APIs: short fast-window (5–30 min) plus mid-window (1–6 h).
- Low-traffic services: use longer windows to avoid noise (1–6 h fast, 1–3 d slow).
- Use rate-of-ratios or rolling windows to smooth sudden batch effects.
Runbook starter template
- Context: What SLO and which windows fired?
- Impact: Which endpoints and users?
- Checks: recent deploys, error spikes by code, dependency health.
- Mitigation: rollback SHA, disable feature flag, increase capacity.
- Escalate: service owner, database on-call, SRE.
- Exit: SLO back within burn-rate thresholds for 1–2 cycles.
Exercises (hands-on)
These mirror the exercises below. Work them here; answers are in collapsible sections.
Exercise 1 (Availability SLO)
SLO: 99.9% monthly. In last 1h: 200,000 requests; 800 bad. In last 6h: 1,200,000 requests; 2,400 bad. Page policy: BR1h ≥ 14 AND BR6h ≥ 7. Warning policy: BR2h ≥ 2 AND BR24h ≥ 1. Which alert, if any, fires?
Show solution
Allowed bad rate = 0.1%. BR1h = 0.4/0.1 = 4. BR6h = 0.2/0.1 = 2. No paging alert. If longer windows confirm, this could generate a warning, not a page.
Exercise 2 (Latency SLO)
SLO: 99% requests ≤ 300 ms. Last 15m: 96% good. Last 6h: 99.2% good. Page policy: BR30m ≥ 14 AND BR6h ≥ 7. What happens, and what’s your next step?
Show solution
Allowed bad rate = 1%. BR15m ≈ 4. BR6h = 0.8. No alert fires. Next step: check recent deploys and dependency latency; consider a non-paging investigation if it persists.
Self-check checklist
- You computed allowed bad rate from the SLO target.
- You used burn rate = observed_bad_rate / allowed_bad_rate.
- You confirmed both windows for paging alerts.
- Your next action depends on alert severity and runbook steps.
Common mistakes and self-check
- Alerting on raw error counts instead of SLO burn rates. Self-check: Are alerts proportional to user-impact budget?
- Single-window pages that either miss fast burns or spam on noise. Self-check: Do you have a fast and a slower window?
- Vague SLI definitions that change over time. Self-check: Is “bad” precisely defined and version-controlled?
- No runbook. Self-check: Can on-call mitigate in 10 minutes with your playbook?
- Ignoring traffic patterns. Self-check: Do windows reflect service QPS and batch jobs?
Practical projects
- Implement a ratio-based SLI for your top 3 endpoints (availability and latency). Validate with real traffic.
- Create multi-window burn-rate alerts (page and warn). Include routing and a concise runbook.
- Run a game day: simulate 5xx spike and latency regression. Verify only the intended alerts fire and on-call can mitigate quickly.
Who this is for, prerequisites, and next steps
Who this is for
- API Engineers owning production services.
- Developers establishing on-call for the first time.
Prerequisites
- Basic metrics and logging familiarity.
- Ability to instrument code for success/latency counters or histograms.
Learning path
- Define clear SLIs per endpoint.
- Pick SLO targets and windows that match user expectations.
- Configure multi-window burn-rate alerts and routing.
- Write and test runbooks with the on-call team.
Next steps
- Automate SLO reports for weekly review.
- Tie error-budget policies to release gates (slow down when budget is low).
Mini challenge
Your service has 99.9% SLO (0.1% budget). Over 30 minutes, bad rate is 2%. Over 3 hours, bad rate is 0.9%. Do you page with policy BR30m ≥ 14 AND BR3h ≥ 7? Show your math, then write a one-paragraph mitigation plan.
Hint
Compute both burn rates: BR = observed_bad_rate / allowed_bad_rate.
Quick Test
The quick test is available to everyone. Only logged-in users get saved progress.
Ready? Take the Quick Test below.