How to learn On Call Practices for Reliability And Operations in Platform Engineer for free

Note: The quick test is available to everyone. Only logged-in users will have their progress saved.

Why this matters

As a Platform Engineer, you will be paged when production is unstable. Clear on-call practices reduce downtime, protect customer trust, and prevent burnout. You will rotate with your team, acknowledge alerts quickly, triage using runbooks, communicate status, and restore service safely.

Real tasks: respond to pages, run incident bridges, coordinate rollback, escalate to specialists, and write post-incident reviews.
Outcomes: lower MTTA/MTTR, fewer false pages, predictable rotations, and reliable handoffs.

Concept explained simply

On-call is a shared, time-boxed responsibility to restore service when things break. It combines people (roles), process (how we respond), and tooling (alerts, runbooks, dashboards).

Mental model

Think of on-call like a fire brigade: one person leads (Incident Commander), responders follow playbooks, the alarm only triggers for real fires (symptoms), and after each event you make the station safer (post-incident actions).

Core components you should master

Rotations and coverage

Clear schedule with primary, secondary, and manager-on-duty (optional).
Fair load: avoid long stretches; swap rules published ahead.
Handoffs: daily or weekly with a structured checklist.

Severity levels and roles

Sev1: critical, widespread impact; Sev2: degraded; Sev3: minor/edge.
Roles: Incident Commander (IC), Communications Lead, Subject Matter Experts (SMEs), Scribe.

Alerting principles

Page on symptoms customers feel (e.g., elevated 5xx, latency SLO breach), not on low-level causes.
Rate-limit and group alerts; set quiet hours for non-urgent notifications.
SLO/Error budget based paging to avoid noise.

Runbooks and playbooks

1-page, actionable steps with rollback, known issues, and validation checks.
Copy-pastable commands, safe-default flags, and annotated screenshots.

Escalation and communication

Hard timeboxes: TTA < 5 min; escalate if no progress in 15–20 min.
Single source of truth: incident channel + status note updated on a cadence (e.g., every 15 min for Sev1).

Metrics that matter

MTTA (acknowledge), MTTR (restore), incident rate, alert noise (pages per shift), false-positive rate.
Error budget burn rate to trigger paging.

Worked examples

Example 1: Latency spike at 02:13

Acknowledge page immediately.
Open the service runbook "High latency". Check dashboards: p95 latency, 5xx rate, saturation.
Roll back last deploy if it happened < 1 hour ago and correlates with the spike.
If no improvement in 15 min, escalate to database SME. IC posts updates every 15 min.
After resolution: note root cause suspect, create ticket to add cache metrics to runbook.

Example 2: Flapping CPU alert (noise)

Alert fires at 11:00 and clears repeatedly; no customer impact.
Classify as non-paging: convert to informational dashboard alert.
Add multi-condition: page only if CPU > 85% AND p95 latency > SLO for 10 min.
Implement a 15 min alert suppression after deploys to avoid transient spikes.

Example 3: Partial outage in EU region

Declare Sev2; assign IC and Comms Lead.
Fail traffic over to another region using documented playbook.
Coordinate with network SME to validate health checks and BGP routes.
Post customer-facing update via status template (Comms Lead).
After: post-incident review with actions: automate regional failover test weekly.

Step-by-step: set up your on-call practice

Define severities: Agree on Sev1–Sev3 definitions and paging expectations.
Create roles: IC, Comms, SMEs; publish responsibilities.
Catalog alerts: For each alert, state signal, customer impact, who is paged, and runbook link.
Write runbooks: Start with your top 5 incidents. Keep them short and actionable.
Schedule rotation: Primary/secondary coverage with fair shifts and a handoff routine.
Drill: Run a 30-minute game day monthly using a past incident.
Improve: Track MTTA/MTTR and noise; tune alerts every sprint.

Exercises

Do these now; they mirror the Quick Test and real work.

Exercise 1: Draft a 1-page runbook

Service: Checkout API. Alert: "p95 latency > 800ms for 10 min and error rate > 2%." Create a concise runbook.

Sections to include: Trigger, Quick checks, Rollback, Mitigations, Validation, Escalation, Notes.
Keep copy-paste commands and clear stop conditions.

Show solution

Title: Checkout API — High Latency & Error Rate
Trigger: p95 > 800ms for 10m AND 5xx > 2%
TTA target: <5m | Escalate if no improvement in 20m
Quick checks:
 - Dashboard: p95, 5xx, QPS, DB CPU, queue depth
 - Recent deploy? ops/deploy history last 60m
 - Dependency health: payments, user, DB
Mitigations (pick one, then validate):
 - Roll back last deploy: deployctl rollback checkout@prev
 - Scale out: k scale deploy checkout --replicas +2
 - Enable read cache: feature flag cache_reads=true
 - Throttle background jobs: jobs set --queue low --rate 50%
Validation:
 - p95 < 600ms for 10m, 5xx < 1%
Escalation:
 - If DB CPU > 85%: page DB-SME
 - If payments dependency errors: page Payments-SME
Notes:
 - Suppress deploys during incident: change-freeze apply --sev2
 - Post-incident: add DB query plan to runbook

Exercise 2: Tune a noisy alert

Current alert: "CPU > 75% for 2 min" paging 30 times/week. No demonstrated customer impact.

Propose a new alert condition using a symptom and a duration that reduces noise.
Add a deploy-suppression rule and a rate limit.

Show solution

New paging rule:
 - Page if: CPU > 85% AND p95 latency > SLO (700ms) for 10 min
 - Otherwise: send as info to channel, no page
Deploy suppression:
 - Mute CPU/latency alerts for 15 min after a deploy to checkout
Rate limit:
 - Max 1 page per incident key per 30 min (group by service+region)
Additional:
 - Add runbook link: "CPU/Latency correlation checks"

On-call readiness checklist

Rotation documented with primary/secondary and escalation tree.
Top 5 incidents have 1-page runbooks.
Alerts page on symptoms, grouped, and rate-limited.
Handoff template used at shift start and end.
IC playbook with comms cadence is published.
Post-incident reviews produce tracked actions.
Noise budget tracked (pages per shift) and reviewed monthly.

Common mistakes and self-check

Mistake: Paging on CPU/disk without user impact. Self-check: Does alert map to SLO or user symptom?
Mistake: Long, outdated runbooks. Self-check: Can a newcomer act within 2 minutes?
Mistake: No single leader during incidents. Self-check: Is IC role assigned every time?
Mistake: Silent escalations. Self-check: Are timeboxes and next steps explicit?
Mistake: Burnout from excessive pages. Self-check: Track pages/shift and rotate fairly; enforce recovery time.

Practical projects

Runbook sprint: create/update the top 5 incident runbooks to 1-page actionable format.
Alert audit: convert 3 cause-based pages to symptom-based conditions with grouping.
Game day: simulate a Sev2 latency incident, measure MTTA/MTTR, and capture improvements.
Handoff revamp: adopt a structured handoff template and run a dry run.

Who this is for

Platform/Backend Engineers joining or improving an on-call rotation.
SREs formalizing incident processes.

Prerequisites

Basic service observability (metrics, logs, traces).
Familiarity with your deployment and rollback process.

Learning path

Learn SLOs and symptom-based alerting.
Write 1-page runbooks with safe mitigations.
Practice roles: IC, Comms, SME in a mock incident.
Measure and reduce MTTA/MTTR; tune alert noise.
Automate: templates, suppression, rate limits, and dashboards.

Next steps

Adopt the readiness checklist for your team this week.
Schedule a 30-minute game day within 2 weeks.
Take the Quick Test below to validate understanding.

Mini challenge

In 30 minutes, convert any one cause-based page into a symptom-based rule with a runbook link and a rate limit. Share the before/after and measure pages/shift for the next week.

Menu

On Call Practices

Table of Contents