Note: The quick test is available to everyone. Only logged-in users will have their progress saved.
Why this matters
As a Platform Engineer, you will be paged when production is unstable. Clear on-call practices reduce downtime, protect customer trust, and prevent burnout. You will rotate with your team, acknowledge alerts quickly, triage using runbooks, communicate status, and restore service safely.
- Real tasks: respond to pages, run incident bridges, coordinate rollback, escalate to specialists, and write post-incident reviews.
- Outcomes: lower MTTA/MTTR, fewer false pages, predictable rotations, and reliable handoffs.
Concept explained simply
On-call is a shared, time-boxed responsibility to restore service when things break. It combines people (roles), process (how we respond), and tooling (alerts, runbooks, dashboards).
Mental model
Think of on-call like a fire brigade: one person leads (Incident Commander), responders follow playbooks, the alarm only triggers for real fires (symptoms), and after each event you make the station safer (post-incident actions).
Core components you should master
Rotations and coverage
- Clear schedule with primary, secondary, and manager-on-duty (optional).
- Fair load: avoid long stretches; swap rules published ahead.
- Handoffs: daily or weekly with a structured checklist.
Severity levels and roles
- Sev1: critical, widespread impact; Sev2: degraded; Sev3: minor/edge.
- Roles: Incident Commander (IC), Communications Lead, Subject Matter Experts (SMEs), Scribe.
Alerting principles
- Page on symptoms customers feel (e.g., elevated 5xx, latency SLO breach), not on low-level causes.
- Rate-limit and group alerts; set quiet hours for non-urgent notifications.
- SLO/Error budget based paging to avoid noise.
Runbooks and playbooks
- 1-page, actionable steps with rollback, known issues, and validation checks.
- Copy-pastable commands, safe-default flags, and annotated screenshots.
Escalation and communication
- Hard timeboxes: TTA < 5 min; escalate if no progress in 15–20 min.
- Single source of truth: incident channel + status note updated on a cadence (e.g., every 15 min for Sev1).
Metrics that matter
- MTTA (acknowledge), MTTR (restore), incident rate, alert noise (pages per shift), false-positive rate.
- Error budget burn rate to trigger paging.
Worked examples
Example 1: Latency spike at 02:13
- Acknowledge page immediately.
- Open the service runbook "High latency". Check dashboards: p95 latency, 5xx rate, saturation.
- Roll back last deploy if it happened < 1 hour ago and correlates with the spike.
- If no improvement in 15 min, escalate to database SME. IC posts updates every 15 min.
- After resolution: note root cause suspect, create ticket to add cache metrics to runbook.
Example 2: Flapping CPU alert (noise)
- Alert fires at 11:00 and clears repeatedly; no customer impact.
- Classify as non-paging: convert to informational dashboard alert.
- Add multi-condition: page only if CPU > 85% AND p95 latency > SLO for 10 min.
- Implement a 15 min alert suppression after deploys to avoid transient spikes.
Example 3: Partial outage in EU region
- Declare Sev2; assign IC and Comms Lead.
- Fail traffic over to another region using documented playbook.
- Coordinate with network SME to validate health checks and BGP routes.
- Post customer-facing update via status template (Comms Lead).
- After: post-incident review with actions: automate regional failover test weekly.
Step-by-step: set up your on-call practice
- Define severities: Agree on Sev1–Sev3 definitions and paging expectations.
- Create roles: IC, Comms, SMEs; publish responsibilities.
- Catalog alerts: For each alert, state signal, customer impact, who is paged, and runbook link.
- Write runbooks: Start with your top 5 incidents. Keep them short and actionable.
- Schedule rotation: Primary/secondary coverage with fair shifts and a handoff routine.
- Drill: Run a 30-minute game day monthly using a past incident.
- Improve: Track MTTA/MTTR and noise; tune alerts every sprint.
Exercises
Do these now; they mirror the Quick Test and real work.
Exercise 1: Draft a 1-page runbook
Service: Checkout API. Alert: "p95 latency > 800ms for 10 min and error rate > 2%." Create a concise runbook.
- Sections to include: Trigger, Quick checks, Rollback, Mitigations, Validation, Escalation, Notes.
- Keep copy-paste commands and clear stop conditions.
Show solution
Title: Checkout API — High Latency & Error Rate
Trigger: p95 > 800ms for 10m AND 5xx > 2%
TTA target: <5m | Escalate if no improvement in 20m
Quick checks:
- Dashboard: p95, 5xx, QPS, DB CPU, queue depth
- Recent deploy? ops/deploy history last 60m
- Dependency health: payments, user, DB
Mitigations (pick one, then validate):
- Roll back last deploy: deployctl rollback checkout@prev
- Scale out: k scale deploy checkout --replicas +2
- Enable read cache: feature flag cache_reads=true
- Throttle background jobs: jobs set --queue low --rate 50%
Validation:
- p95 < 600ms for 10m, 5xx < 1%
Escalation:
- If DB CPU > 85%: page DB-SME
- If payments dependency errors: page Payments-SME
Notes:
- Suppress deploys during incident: change-freeze apply --sev2
- Post-incident: add DB query plan to runbook
Exercise 2: Tune a noisy alert
Current alert: "CPU > 75% for 2 min" paging 30 times/week. No demonstrated customer impact.
- Propose a new alert condition using a symptom and a duration that reduces noise.
- Add a deploy-suppression rule and a rate limit.
Show solution
New paging rule:
- Page if: CPU > 85% AND p95 latency > SLO (700ms) for 10 min
- Otherwise: send as info to channel, no page
Deploy suppression:
- Mute CPU/latency alerts for 15 min after a deploy to checkout
Rate limit:
- Max 1 page per incident key per 30 min (group by service+region)
Additional:
- Add runbook link: "CPU/Latency correlation checks"
On-call readiness checklist
- Rotation documented with primary/secondary and escalation tree.
- Top 5 incidents have 1-page runbooks.
- Alerts page on symptoms, grouped, and rate-limited.
- Handoff template used at shift start and end.
- IC playbook with comms cadence is published.
- Post-incident reviews produce tracked actions.
- Noise budget tracked (pages per shift) and reviewed monthly.
Common mistakes and self-check
- Mistake: Paging on CPU/disk without user impact. Self-check: Does alert map to SLO or user symptom?
- Mistake: Long, outdated runbooks. Self-check: Can a newcomer act within 2 minutes?
- Mistake: No single leader during incidents. Self-check: Is IC role assigned every time?
- Mistake: Silent escalations. Self-check: Are timeboxes and next steps explicit?
- Mistake: Burnout from excessive pages. Self-check: Track pages/shift and rotate fairly; enforce recovery time.
Practical projects
- Runbook sprint: create/update the top 5 incident runbooks to 1-page actionable format.
- Alert audit: convert 3 cause-based pages to symptom-based conditions with grouping.
- Game day: simulate a Sev2 latency incident, measure MTTA/MTTR, and capture improvements.
- Handoff revamp: adopt a structured handoff template and run a dry run.
Who this is for
- Platform/Backend Engineers joining or improving an on-call rotation.
- SREs formalizing incident processes.
Prerequisites
- Basic service observability (metrics, logs, traces).
- Familiarity with your deployment and rollback process.
Learning path
- Learn SLOs and symptom-based alerting.
- Write 1-page runbooks with safe mitigations.
- Practice roles: IC, Comms, SME in a mock incident.
- Measure and reduce MTTA/MTTR; tune alert noise.
- Automate: templates, suppression, rate limits, and dashboards.
Next steps
- Adopt the readiness checklist for your team this week.
- Schedule a 30-minute game day within 2 weeks.
- Take the Quick Test below to validate understanding.
Mini challenge
In 30 minutes, convert any one cause-based page into a symptom-based rule with a runbook link and a rate limit. Share the before/after and measure pages/shift for the next week.