How to learn Alerting And On Call Basics for Observability And Operations in Backend Engineer for free

Why this matters

Example 3: Queue backlog

Batch processor target: messages processed within 5 minutes. Alerting:

Page: backlog age > 20 minutes for 15 minutes (user-facing delays are accumulating).
Ticket: backlog count > 5x daily average for 2h (capacity trend).

Runbook step: add consumers; if upstream spike, enable rate limiting; if consumer errors, roll back consumer version.

Templates you can adapt

Symptom-first paging: Page when [user-facing metric] breaches [SLO-driven threshold] for [short window] and [long window].
Capacity ticket: Create ticket when [resource metric] exceeds [threshold] for [long window] without SLO impact.
Batch delay page: Page when [age-of-oldest] exceeds [acceptable delay] for [sustained period].

Practical setup steps

Define SLOs: choose success criteria (e.g., non-5xx responses). Set target (e.g., 99.9%).
Pick paging conditions: multi-window burn rates for request paths that matter.
Create runbooks: one-page actions for each page type. Include rollback, feature flag, and contact lists.
Design escalation: primary on-call, secondary, and duty manager. Set acknowledgment timeouts (e.g., 5 minutes).
Set silences: during planned maintenance or noisy deploys. Time-bound and with a reason.
Test pages: fire a test alert in work hours; verify phone/app, volume, and routing.
Review weekly: remove noisy alerts, adjust thresholds, update runbooks with real incident notes.

Runbook skeleton

What the alert means (symptom and likely causes)
Immediate actions (roll back? scale? pause traffic?)
Verification steps (dashboards to watch)
Escalation contacts (DBA, networking, SRE)
Mitigations vs. fixes (short-term vs. root cause)
When to close (criteria)

Common mistakes and how to self-check

Too many cause-based alerts: you page for CPU spikes with no user impact. Self-check: for every page, can you point to an SLO at risk?
Noisy thresholds: flapping alerts. Self-check: does your alert require two windows or a sustained period?
Ambiguous pages: the on-call does not know what to do. Self-check: does the alert link to a runbook with first three actions?
Weak escalation: pages sit unacknowledged. Self-check: simulate a no-ack scenario; does it auto-escalate within 5–10 minutes?
Forgotten silences: alerts muted too long. Self-check: all silences have expiry and a clear reason.

Exercises

These mirror the graded exercises below so you can draft answers here first.

Exercise 1: Design an SLO-based page for 5xx errors

Assume SLO 99.9% for 28 days. Define one critical page and one warning alert using multi-window burn rates.
Write the first three runbook steps the on-call should take.

Checklist: uses two windows per alert, references SLO/budget, clearly separates page vs. ticket, includes immediate rollback check.

Hint

Use a fast window (5–30 min) and a slower window (1–6 h). Critical should deplete budget quickly; warning is slower burn.

Exercise 2: On-call rotation and escalation

Create a weekly rotation with primary and secondary engineers.
Set acknowledgment and escalation timings.
Describe a quiet-hours policy and how to handle non-urgent alerts overnight.

Checklist: hours defined, handoff time set, ack timeout, escalation path, overnight policy.

Hint

Typical ack timeout is 5 minutes; escalate at 10–15 minutes to secondary; duty manager after 20–30 minutes.

Practical projects

Project 1: Convert one noisy CPU alert into a symptom-based alert tied to latency or SLO. Measure page volume before/after.
Project 2: Write or update three runbooks (errors, saturation, backlog). Have a teammate dry-run them.
Project 3: Implement weekly on-call handoff checklist and a retro to review last week’s pages and fixes.

Learning path

Start: SLOs and error budgets for your service.
Next: Alert design (multi-window burn rates, symptom-first).
Then: On-call processes (rotation, escalation, silences, runbooks).
Advance: Incident command and post-incident reviews.

Mini challenge

Choose one service you own. In one hour, draft: one paging alert, one ticket-only alert, and a 10-line runbook for each. Share with a teammate for review and do a dry-run by simulating each alert.

Next steps

Finish the exercises below, then take the quick test.
Schedule a test page to verify your escalation path works end-to-end.
Add a recurring calendar reminder to review alerts weekly.

Quick test

The quick test is available to everyone. If you are logged in, your progress will be saved so you can resume anytime.

Instructions

Assume your API has an SLO of 99.9% success over 28 days. Create two alerts:

Critical page: multi-window burn rate that would exhaust the error budget within a few hours.
Warning ticket: multi-window burn rate that indicates a sustained problem but not immediate exhaustion.

Write the first three steps in the runbook for the critical page.

Menu

Alerting And On Call Basics

Table of Contents

Why this matters

Example 3: Queue backlog

Practical setup steps

Common mistakes and how to self-check

Exercises

Exercise 1: Design an SLO-based page for 5xx errors

Exercise 2: On-call rotation and escalation

Practical projects

Learning path

Mini challenge

Next steps

Quick test

Practice Exercises

Design an SLO-based alert for HTTP 5xx errors

Instructions

Expected Output

Create an on-call rotation and escalation policy

Alerting And On Call Basics — Quick Test

Have questions about Alerting And On Call Basics?

AI Assistant