Why this matters
In API engineering, incidents happen: error spikes, latency regressions, dependency outages, bad deploys. Fast, calm triage protects customers and buys time to fix. Strong triage reduces downtime, confusion, and duplicate work.
- Real tasks: classify severity, stabilize impact, route work to the right owners, communicate clear updates, and gather the minimum data to decide the next step.
- You will use runbooks, alerts, dashboards, logs, traces, rollbacks, feature flags, rate limits, and incident roles.
Who this is for
- API Engineers and on-call developers who respond to production issues.
- Platform/SRE teammates who coordinate incident response.
- Team leads who need consistent, low-drama incident handling.
Prerequisites
- Basic understanding of your service topology and dependencies (API, DB, cache, message bus).
- Access to metrics, logs, and traces in your observability stack.
- Familiarity with your alerting, paging, and rollback/feature-flag tooling.
Concept explained simply
Triage is the first 15–30 minutes of an incident. Your goal is not to find deep root cause. Your goal is to stop the bleeding, size the problem, and point the right people in the right direction with clear updates.
Mental model: STABLE
- S – Stabilize: Reduce impact quickly (rollback, disable feature, rate limit, fail open/closed).
- T – Triage: Set severity, declare roles, and open an incident channel.
- A – Assess: Collect key signals (metrics/logs/traces) to form a first hypothesis. Timebox to 10 minutes.
- B – Bound: Narrow the blast radius (is it one endpoint, one region, one rollout group?).
- L – Loop: Communicate status and ETAs at a steady cadence (e.g., every 15 minutes).
- E – Escalate/Execute: Pull in owners and run runbooks or mitigations.
Severity ladder (quick reference)
- SEV1: Critical, widespread impact (major outage, data loss risk, legal/compliance risk). All hands.
- SEV2: Significant impact (core API degraded, many customers affected). On-call + owners.
- SEV3: Limited impact (specific feature/region/segment). On-call owns.
- SEV4: Minor, workaround exists. Track and schedule fix.
Common signal sources to check first
- Golden signals: error rate, latency, traffic, saturation (CPU/memory/IO/queue depth).
- Recent changes: deploys, config/flag flips, migrations, infra changes.
- Dependencies: DB health, cache hit rates, upstream/downstream status.
- Scope: regions, tenants, endpoints, versions, rollout rings.
Step-by-step triage flow (15–30 minutes)
- Declare: Name incident, set severity, create a shared channel, assign roles (Incident Lead, Comms, Ops). If solo, you are all roles.
- Stabilize now: If a recent change correlates, consider rollback/disable flag. If traffic overwhelms capacity, apply rate limits or temporary degradation (e.g., serve cached responses).
- Timebox data gathering: 10 minutes to check dashboards, top logs, and traces. Capture 3–5 facts (metrics trend, timeline, change list, scope).
- Bound blast radius: Identify which endpoints/regions/tenants are affected. Redirect traffic if possible.
- Update cadence: Post a brief update: What we know, What we’re doing, Next update time.
- Call owners: Page service owners of the suspected component. Share timeline and current mitigation.
What to communicate (template)
- Impact: Who/what is affected (numbers if possible).
- Status: Stabilized, investigating, or recovering.
- Actions: Current mitigation and next step.
- ETA: Next update in X minutes; precise ETAs only if confident.
Worked examples
Example 1: Error spike after deploy
- Signals: 5 minutes after deploy, HTTP 5xx rises from 0.2% to 7% for /v1/payments. Latency OK. Logs show validation failure exceptions.
- STABLE actions:
- S: Roll back the last deploy or disable the new validation flag.
- T: Classify SEV2 (many customers impacted on a core endpoint).
- A: Confirm correlation on deployment timeline and check traces for failing code path.
- B: Scope limited to /v1/payments; all regions affected.
- L: Communicate rollback in progress; next update in 10 minutes.
- E: Notify payments service owner and QA for hotfix follow-up.
- Outcome: Rollback drops errors to baseline; incident downgrades and closes after verification.
Example 2: Regional latency regression
- Signals: p95 latency up 3x in eu-west only. Traffic normal. Cache hit rate dropped; DB CPU high in eu-west.
- STABLE actions:
- S: Shift a portion of eu-west traffic to eu-central; warm cache with prefetch for hot keys.
- T: SEV3 (regional, degraded, workarounds exist).
- A: Compare DB metrics and cache metrics pre/post regression; check for regional config drift.
- B: Impact limited to EU-West. Non-critical endpoints worse.
- L: Update stakeholders; next update in 15 minutes.
- E: Page DB on-call to investigate index regression or hot partition.
- Outcome: Hot index recreated; cache warmed; latency normalizes.
Example 3: Partial outage due to dependency
- Signals: Upstream identity provider returns 503 intermittently. Your login/auth endpoints error rate at 10%, retries ineffective.
- STABLE actions:
- S: Fail open for idempotent reads where safe, present cached tokens within short TTL, lengthen backoff.
- T: SEV2 if many users cannot log in; SEV3 if a subset.
- A: Check dependency status and your retry/circuit-breaker behavior.
- B: Limit to auth flows; keep other APIs healthy.
- L: Communicate external dependency issue and your mitigations.
- E: Engage partner’s status/ops channel; consider feature toggle to reduce auth requests.
- Outcome: Circuit breaker + caching caps impact until upstream recovers.
Exercises
These mirror the exercises below. Try first, then open the solution.
Exercise 1: Classify and stabilize an error spike
Scenario: In the last 7 minutes, 5xx errors rose from 0.1% to 6% on /v1/orders. Deploy happened 9 minutes ago. Traces show most failures at a new pricing check; DB metrics stable; latency unchanged. 40% of customers call this endpoint hourly.
- Decide severity and why.
- Pick the first mitigation.
- List 3 data points to confirm the hypothesis.
- Write a two-line status update.
Hints
- Look for correlation with recent changes.
- Favor reversible mitigations (rollback/flag off).
- Scope the impact: single endpoint vs whole API.
Exercise 2: Regional saturation
Scenario: In us-east, CPU is 90%+, queue depth growing, p95 latency doubled. Other regions are normal. No deploys in the last hour. Traffic +20% due to a partner campaign.
- Choose severity and immediate mitigation.
- Bound the blast radius.
- List follow-up actions and owners.
Hints
- Think rate limits, autoscaling, and traffic shifting.
- Communicate cadence and ETA for the next update.
Self-check checklist
- You can state a severity with a short, clear justification.
- You propose a mitigation that reduces impact within 10 minutes.
- You identify whether it is global or scoped (endpoint/region/tenant).
- Your status update mentions impact, action, and next update time.
Common mistakes and how to self-check
- Hunting root cause too early: Ask, "Does this reduce impact now?" If not, defer.
- Unclear ownership: Assign a lead and specific owners for each action.
- Silent investigation: Post updates on a cadence even if there’s no news.
- Over-mitigating: Prefer reversible, scoped mitigations; avoid platform-wide changes unless necessary.
- Missing the timeline: Keep a lightweight log of actions to avoid loops and aid handoffs.
Practical projects
- Create a one-page runbook template: severity, roles, first 10-minute checks, rollback steps, status update snippet.
- Build a dashboard for the golden signals of one service (error rate, p95 latency, traffic, saturation) with deploy markers.
- Set up a safe game day: trigger a controlled failure in staging, practice STABLE, and time your first mitigation.
Learning path
- Start: Incident triage (this page) – stabilize first, then investigate.
- Next: Alert design and SLOs – alert on symptoms, not just causes; define acceptable error/latency.
- Then: Post-incident reviews – capture learnings and improve runbooks, alerts, and tests.
- Later: Resilience patterns – circuit breakers, backpressure, bulkheads, and graceful degradation.
Mini challenge
In the last 4 minutes, error rate is normal, but p95 latency doubled for write endpoints only. Cache hit rate is fine; DB write IOPS maxed. No deploys in 30 minutes. What is your STABLE plan in 5 bullet points?
Sample approach
- S: Temporarily reduce write QPS (rate limit) and enable write queue smoothing.
- T: SEV2 if many writes blocked; SEV3 if a segment.
- A: Confirm DB write contention (locks/hot partitions) and top write queries.
- B: Limit impact to write endpoints; keep reads fully available.
- L/E: Update, then page DB owners to apply targeted mitigation (index/partitioning/throughput scaling).
Next steps
- Save your runbook template where on-call can find it.
- Practice with a teammate: 10-minute drills using the examples above.
- Take the Quick Test below. Everyone can take it for free; if you’re logged in, your progress will be saved.