Menu

Topic 1 of 8

On Call Practices

Learn On Call Practices for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Note: The quick test is available to everyone. Only logged-in users will have their progress saved.

Why this matters

As a Platform Engineer, you will be paged when production is unstable. Clear on-call practices reduce downtime, protect customer trust, and prevent burnout. You will rotate with your team, acknowledge alerts quickly, triage using runbooks, communicate status, and restore service safely.

  • Real tasks: respond to pages, run incident bridges, coordinate rollback, escalate to specialists, and write post-incident reviews.
  • Outcomes: lower MTTA/MTTR, fewer false pages, predictable rotations, and reliable handoffs.

Concept explained simply

On-call is a shared, time-boxed responsibility to restore service when things break. It combines people (roles), process (how we respond), and tooling (alerts, runbooks, dashboards).

Mental model

Think of on-call like a fire brigade: one person leads (Incident Commander), responders follow playbooks, the alarm only triggers for real fires (symptoms), and after each event you make the station safer (post-incident actions).

Core components you should master

Rotations and coverage
  • Clear schedule with primary, secondary, and manager-on-duty (optional).
  • Fair load: avoid long stretches; swap rules published ahead.
  • Handoffs: daily or weekly with a structured checklist.
Severity levels and roles
  • Sev1: critical, widespread impact; Sev2: degraded; Sev3: minor/edge.
  • Roles: Incident Commander (IC), Communications Lead, Subject Matter Experts (SMEs), Scribe.
Alerting principles
  • Page on symptoms customers feel (e.g., elevated 5xx, latency SLO breach), not on low-level causes.
  • Rate-limit and group alerts; set quiet hours for non-urgent notifications.
  • SLO/Error budget based paging to avoid noise.
Runbooks and playbooks
  • 1-page, actionable steps with rollback, known issues, and validation checks.
  • Copy-pastable commands, safe-default flags, and annotated screenshots.
Escalation and communication
  • Hard timeboxes: TTA < 5 min; escalate if no progress in 15–20 min.
  • Single source of truth: incident channel + status note updated on a cadence (e.g., every 15 min for Sev1).
Metrics that matter
  • MTTA (acknowledge), MTTR (restore), incident rate, alert noise (pages per shift), false-positive rate.
  • Error budget burn rate to trigger paging.

Worked examples

Example 1: Latency spike at 02:13

  1. Acknowledge page immediately.
  2. Open the service runbook "High latency". Check dashboards: p95 latency, 5xx rate, saturation.
  3. Roll back last deploy if it happened < 1 hour ago and correlates with the spike.
  4. If no improvement in 15 min, escalate to database SME. IC posts updates every 15 min.
  5. After resolution: note root cause suspect, create ticket to add cache metrics to runbook.

Example 2: Flapping CPU alert (noise)

  1. Alert fires at 11:00 and clears repeatedly; no customer impact.
  2. Classify as non-paging: convert to informational dashboard alert.
  3. Add multi-condition: page only if CPU > 85% AND p95 latency > SLO for 10 min.
  4. Implement a 15 min alert suppression after deploys to avoid transient spikes.

Example 3: Partial outage in EU region

  1. Declare Sev2; assign IC and Comms Lead.
  2. Fail traffic over to another region using documented playbook.
  3. Coordinate with network SME to validate health checks and BGP routes.
  4. Post customer-facing update via status template (Comms Lead).
  5. After: post-incident review with actions: automate regional failover test weekly.

Step-by-step: set up your on-call practice

  1. Define severities: Agree on Sev1–Sev3 definitions and paging expectations.
  2. Create roles: IC, Comms, SMEs; publish responsibilities.
  3. Catalog alerts: For each alert, state signal, customer impact, who is paged, and runbook link.
  4. Write runbooks: Start with your top 5 incidents. Keep them short and actionable.
  5. Schedule rotation: Primary/secondary coverage with fair shifts and a handoff routine.
  6. Drill: Run a 30-minute game day monthly using a past incident.
  7. Improve: Track MTTA/MTTR and noise; tune alerts every sprint.

Exercises

Do these now; they mirror the Quick Test and real work.

Exercise 1: Draft a 1-page runbook

Service: Checkout API. Alert: "p95 latency > 800ms for 10 min and error rate > 2%." Create a concise runbook.

  • Sections to include: Trigger, Quick checks, Rollback, Mitigations, Validation, Escalation, Notes.
  • Keep copy-paste commands and clear stop conditions.
Show solution
Title: Checkout API — High Latency & Error Rate
Trigger: p95 > 800ms for 10m AND 5xx > 2%
TTA target: <5m | Escalate if no improvement in 20m
Quick checks:
 - Dashboard: p95, 5xx, QPS, DB CPU, queue depth
 - Recent deploy? ops/deploy history last 60m
 - Dependency health: payments, user, DB
Mitigations (pick one, then validate):
 - Roll back last deploy: deployctl rollback checkout@prev
 - Scale out: k scale deploy checkout --replicas +2
 - Enable read cache: feature flag cache_reads=true
 - Throttle background jobs: jobs set --queue low --rate 50%
Validation:
 - p95 < 600ms for 10m, 5xx < 1%
Escalation:
 - If DB CPU > 85%: page DB-SME
 - If payments dependency errors: page Payments-SME
Notes:
 - Suppress deploys during incident: change-freeze apply --sev2
 - Post-incident: add DB query plan to runbook
    

Exercise 2: Tune a noisy alert

Current alert: "CPU > 75% for 2 min" paging 30 times/week. No demonstrated customer impact.

  • Propose a new alert condition using a symptom and a duration that reduces noise.
  • Add a deploy-suppression rule and a rate limit.
Show solution
New paging rule:
 - Page if: CPU > 85% AND p95 latency > SLO (700ms) for 10 min
 - Otherwise: send as info to channel, no page
Deploy suppression:
 - Mute CPU/latency alerts for 15 min after a deploy to checkout
Rate limit:
 - Max 1 page per incident key per 30 min (group by service+region)
Additional:
 - Add runbook link: "CPU/Latency correlation checks"
    

On-call readiness checklist

  • Rotation documented with primary/secondary and escalation tree.
  • Top 5 incidents have 1-page runbooks.
  • Alerts page on symptoms, grouped, and rate-limited.
  • Handoff template used at shift start and end.
  • IC playbook with comms cadence is published.
  • Post-incident reviews produce tracked actions.
  • Noise budget tracked (pages per shift) and reviewed monthly.

Common mistakes and self-check

  • Mistake: Paging on CPU/disk without user impact. Self-check: Does alert map to SLO or user symptom?
  • Mistake: Long, outdated runbooks. Self-check: Can a newcomer act within 2 minutes?
  • Mistake: No single leader during incidents. Self-check: Is IC role assigned every time?
  • Mistake: Silent escalations. Self-check: Are timeboxes and next steps explicit?
  • Mistake: Burnout from excessive pages. Self-check: Track pages/shift and rotate fairly; enforce recovery time.

Practical projects

  1. Runbook sprint: create/update the top 5 incident runbooks to 1-page actionable format.
  2. Alert audit: convert 3 cause-based pages to symptom-based conditions with grouping.
  3. Game day: simulate a Sev2 latency incident, measure MTTA/MTTR, and capture improvements.
  4. Handoff revamp: adopt a structured handoff template and run a dry run.

Who this is for

  • Platform/Backend Engineers joining or improving an on-call rotation.
  • SREs formalizing incident processes.

Prerequisites

  • Basic service observability (metrics, logs, traces).
  • Familiarity with your deployment and rollback process.

Learning path

  1. Learn SLOs and symptom-based alerting.
  2. Write 1-page runbooks with safe mitigations.
  3. Practice roles: IC, Comms, SME in a mock incident.
  4. Measure and reduce MTTA/MTTR; tune alert noise.
  5. Automate: templates, suppression, rate limits, and dashboards.

Next steps

  • Adopt the readiness checklist for your team this week.
  • Schedule a 30-minute game day within 2 weeks.
  • Take the Quick Test below to validate understanding.

Mini challenge

In 30 minutes, convert any one cause-based page into a symptom-based rule with a runbook link and a rate limit. Share the before/after and measure pages/shift for the next week.

Practice Exercises

2 exercises to complete

Instructions

Create a 1-page runbook for the alert: "p95 latency > 800ms for 10 min and error rate > 2%" for Checkout API.

  • Include: Trigger, Quick checks, Mitigations (rollback/scale/flag), Validation, Escalation, Notes.
  • Use copy-paste commands and state clear stop conditions.
Expected Output
A concise, actionable runbook with step-by-step mitigations and validation criteria.

On Call Practices — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about On Call Practices?

AI Assistant

Ask questions about this tool