Why this matters
Example 3: Queue backlog
Batch processor target: messages processed within 5 minutes. Alerting:
- Page: backlog age > 20 minutes for 15 minutes (user-facing delays are accumulating).
- Ticket: backlog count > 5x daily average for 2h (capacity trend).
Runbook step: add consumers; if upstream spike, enable rate limiting; if consumer errors, roll back consumer version.
Templates you can adapt
- Symptom-first paging: Page when [user-facing metric] breaches [SLO-driven threshold] for [short window] and [long window].
- Capacity ticket: Create ticket when [resource metric] exceeds [threshold] for [long window] without SLO impact.
- Batch delay page: Page when [age-of-oldest] exceeds [acceptable delay] for [sustained period].
Practical setup steps
- Define SLOs: choose success criteria (e.g., non-5xx responses). Set target (e.g., 99.9%).
- Pick paging conditions: multi-window burn rates for request paths that matter.
- Create runbooks: one-page actions for each page type. Include rollback, feature flag, and contact lists.
- Design escalation: primary on-call, secondary, and duty manager. Set acknowledgment timeouts (e.g., 5 minutes).
- Set silences: during planned maintenance or noisy deploys. Time-bound and with a reason.
- Test pages: fire a test alert in work hours; verify phone/app, volume, and routing.
- Review weekly: remove noisy alerts, adjust thresholds, update runbooks with real incident notes.
Runbook skeleton
- What the alert means (symptom and likely causes)
- Immediate actions (roll back? scale? pause traffic?)
- Verification steps (dashboards to watch)
- Escalation contacts (DBA, networking, SRE)
- Mitigations vs. fixes (short-term vs. root cause)
- When to close (criteria)
Common mistakes and how to self-check
- Too many cause-based alerts: you page for CPU spikes with no user impact. Self-check: for every page, can you point to an SLO at risk?
- Noisy thresholds: flapping alerts. Self-check: does your alert require two windows or a sustained period?
- Ambiguous pages: the on-call does not know what to do. Self-check: does the alert link to a runbook with first three actions?
- Weak escalation: pages sit unacknowledged. Self-check: simulate a no-ack scenario; does it auto-escalate within 5–10 minutes?
- Forgotten silences: alerts muted too long. Self-check: all silences have expiry and a clear reason.
Exercises
These mirror the graded exercises below so you can draft answers here first.
Exercise 1: Design an SLO-based page for 5xx errors
- Assume SLO 99.9% for 28 days. Define one critical page and one warning alert using multi-window burn rates.
- Write the first three runbook steps the on-call should take.
- Checklist: uses two windows per alert, references SLO/budget, clearly separates page vs. ticket, includes immediate rollback check.
Hint
Use a fast window (5–30 min) and a slower window (1–6 h). Critical should deplete budget quickly; warning is slower burn.
Exercise 2: On-call rotation and escalation
- Create a weekly rotation with primary and secondary engineers.
- Set acknowledgment and escalation timings.
- Describe a quiet-hours policy and how to handle non-urgent alerts overnight.
- Checklist: hours defined, handoff time set, ack timeout, escalation path, overnight policy.
Hint
Typical ack timeout is 5 minutes; escalate at 10–15 minutes to secondary; duty manager after 20–30 minutes.
Practical projects
- Project 1: Convert one noisy CPU alert into a symptom-based alert tied to latency or SLO. Measure page volume before/after.
- Project 2: Write or update three runbooks (errors, saturation, backlog). Have a teammate dry-run them.
- Project 3: Implement weekly on-call handoff checklist and a retro to review last week’s pages and fixes.
Learning path
- Start: SLOs and error budgets for your service.
- Next: Alert design (multi-window burn rates, symptom-first).
- Then: On-call processes (rotation, escalation, silences, runbooks).
- Advance: Incident command and post-incident reviews.
Mini challenge
Choose one service you own. In one hour, draft: one paging alert, one ticket-only alert, and a 10-line runbook for each. Share with a teammate for review and do a dry-run by simulating each alert.
Next steps
- Finish the exercises below, then take the quick test.
- Schedule a test page to verify your escalation path works end-to-end.
- Add a recurring calendar reminder to review alerts weekly.
Quick test
The quick test is available to everyone. If you are logged in, your progress will be saved so you can resume anytime.