Why this matters
As a Platform Engineer building an observability platform, you translate noisy alerts into clear, reliable incident workflows that reduce downtime. Your work directly affects MTTA and MTTR, customer trust, and developer productivity.
- Real tasks you will do: define severity levels, route alerts into incidents, automate escalations, run incident war rooms, and capture learnings in blameless reviews.
- Outcome: fewer paging storms, faster mitigation, consistent communication, and measurable reliability improvements.
Concept explained simply
An incident management workflow is the repeatable path an alert follows from detection to learning. Think of it as a relay race: each stage hands off to the next with clear roles, timers, and checklists.
Mental model
Use the 8-step loop:
- Detect
- Acknowledge
- Triage
- Mitigate
- Verify
- Resolve
- Communicate
- Review & Learn
Attach timers and owners per step. Metrics to watch: MTTD (detect), MTTA (ack), MTTR (resolve).
Lifecycle and roles
- Severity: P1 (critical), P2 (major), P3 (minor), P4 (trivial). Tie to customer impact and SLO breach risk.
- Roles: Incident Commander (IC), Subject Matter Expert (SME), Communications Lead, Scribe.
- Escalation policy: who gets paged, timers for auto-escalate, and maximum hop count.
- Runbooks: step-by-step actions, prechecks, safe rollbacks, verification steps.
- Comms: when and how to update internal stakeholders and customers.
Role cheat sheet
- IC: coordinates, decides, keeps people focused.
- SME: investigates and executes technical steps.
- Comms: posts status updates on agreed cadence.
- Scribe: records timeline, decisions, and outcomes.
Core building blocks you will design
- Alert-to-incident rules: deduplicate, correlate by service/region/cause, create a single incident with related alerts attached.
- Severity matrix: examples—P1 (full outage), P2 (partial regional impact), P3 (degraded but functional), P4 (cosmetic or planned risk).
- Escalations: primary on-call (5 min), secondary (10 min), duty manager (15 min), executive (30 min) for P1.
- ChatOps templates: standard commands to declare, assign roles, and post updates.
- Dashboards: per-incident views preloaded with key SLOs, recent deploys, error spikes, saturation.
- Post-incident review: blameless summary, timeline, contributing factors, what helped/hurt detection/mitigation, and action items with owners/dates.
Worked examples
Example 1: Alert flood from 50 pods
Situation: 50 CPU alerts trigger within 2 minutes on the same service.
- Correlation rule groups alerts by service and cluster, creating one P2 incident with related alerts attached.
- IC assigned automatically; SME is service owner on-call.
- Runbook directs to check recent deploys and autoscaling health; mitigation is to rollback last deploy.
- MTTA target: 5 min; MTTR target: 30 min. Verification: error rate and latency back to baseline.
Example 2: Partial outage in EU region
Situation: 30% requests fail in EU; other regions OK.
- Severity: P2 (regional customer impact, SLO risk).
- Comms: internal in 5 min, customer-facing update in 15 min with next update cadence of 30 min.
- Mitigation: drain traffic from faulty EU AZ, scale healthy AZs, investigate network dependency.
- Resolution: route stabilized; root cause later shows misconfigured firewall.
Example 3: Disk almost full on primary DB node
Situation: 88% disk usage on primary DB; rising.
- Severity: P2 if risk of write failures in next hour.
- Runbook: rotate logs, purge old backups, expand volume if below 90% and safe; failover if >95% imminent.
- Automation: pre-approved runbook steps execute via ChatOps with confirmation prompts; IC monitors.
- Verification: write latency normal, headroom > 20%.
Hands-on exercises
Do these to practice. You can compare your work with the solutions. The quick test is available to everyone; only logged-in users will see saved progress.
Exercise 1: Design a Severity Matrix and Escalation Policy
ID: ex1
- Define P1–P4 with clear impact statements.
- Set MTTA/MTTR targets per severity.
- Create an escalation tree for P1 and P2 with timers.
- Write a 3-line declaration template for P1.
Exercise 2: Draft an End-to-End Incident Workflow with ChatOps Prompts
ID: ex2
- Map Detect → Acknowledge → Triage → Mitigate → Verify → Resolve → Review.
- Create ChatOps snippets for: declare, assign roles, request SME, set update cadence, and close.
- Add the specific dashboards and logs to open automatically.
Self-check checklist
Common mistakes and how to self-check
- Vague severities: Fix by using measurable impact (e.g., % errors, region scope, user flows).
- Alert noise → multiple incidents: Add deduplication and correlation keys (service, region, deploy SHA).
- Slow acks: Enforce auto-escalation timers and ensure on-call contact methods are reliable.
- Investigating instead of stabilizing: Prioritize safe mitigation first, RCA later.
- No verification: Require objective success checks before resolving.
- Action items without owners: Assign and track dates; review completion weekly.
Quick self-audit mini-list
- Do you have clear P1/P2 triggers?
- Is there a single command to declare an incident?
- Can you page the SME group in under 60 seconds?
- Are comms templates one-click ready?
Practical projects
- Project A: Build an incident declaration template and make it auto-populate service, cluster, region, and recent deploy info from alerts.
- Project B: Create a severity matrix for three sample services (API, Payments, Data Pipeline) and test it with mock incidents.
- Project C: Write a blameless review template and run a mock post-incident review using a past SEV-2 scenario.
Learning path
- Start: Document the 8-step lifecycle and roles for your org.
- Define: Severity matrix and escalation timers.
- Automate: ChatOps commands for declare/assign/update/close.
- Integrate: Alert correlation and incident creation rules.
- Practice: Run monthly game days; measure MTTA/MTTR.
- Improve: Add dashboards and post-incident review workflow.
Who this is for
- Platform and SRE engineers owning observability and on-call.
- Backend engineers rotating on-call who need consistent workflows.
- Team leads who coordinate incident response.
Prerequisites
- Basic monitoring/alerting concepts (metrics, logs, traces).
- Familiarity with on-call and paging tools.
- Comfort with runbooks and service ownership.
Next steps
- Finalize your severity matrix and publish it to the team.
- Implement one ChatOps command to declare incidents.
- Schedule a 30-minute tabletop drill this week.
Mini challenge
In 10 minutes, write a one-page playbook for a P1 API outage including: declaration command, initial roles, first three mitigation actions, and first customer update text. Keep it concise and testable.
Progress & Quick Test
Take the quick test below to check your understanding. Anyone can take it for free; only logged-in users will have their progress saved.