Menu

Topic 5 of 8

Change Risk Management

Learn Change Risk Management for free with explanations, exercises, and a quick test (for Platform Engineer).

Published: January 23, 2026 | Updated: January 23, 2026

Why this matters

As a Platform Engineer, you change production systems: deploy services, alter infrastructure, run migrations, and apply patches. Every change carries riskβ€”user impact, outages, data loss, or security regressions. Good change risk management helps you ship faster with confidence, reduce incidents, and protect SLOs/error budgets.

  • Real tasks: plan rollouts, choose canary vs. blue/green, define rollback criteria, get approvals, and monitor impact.
  • Outcomes: fewer incident calls, predictable releases, and solid auditability for compliance.

Who this is for

  • Platform/DevOps/SRE/Backend engineers who deploy or manage infrastructure.
  • Tech leads who approve or coordinate changes.
  • Engineers building internal tooling for releases and rollbacks.

Prerequisites

  • Basic Git and CI/CD familiarity.
  • Understanding of service health: latency, errors, saturation, and SLOs.
  • Comfort with reading dashboards and alerts.

Concept explained simply

Change risk management is a lightweight system to judge how dangerous a change is, reduce that risk, communicate it, and recover fast if needed.

Mental model

Think of each change as a rocket launch:

  • Pre-flight: verify checklists, simulate, get a go/no-go.
  • Launch: start small (canary), watch telemetry, be ready to abort.
  • After: confirm success, log decisions, learn for next time.

Core components of change risk

  • Change type: standard (pre-approved), normal (needs review), emergency (expedited with safeguards).
  • Blast radius: how many users/systems can be affected.
  • Impact Γ— Likelihood: risk score = impact (1–5) Γ— likelihood (1–5). Categorize: Low (1–5), Medium (6–10), High (11–15), Critical (16–25).
  • Timing: maintenance windows, freezes, on-call coverage.
  • Guardrails: canary, feature flags, throttles, rate limits, circuit breakers.
  • Approvals: peer review, change approver, CAB for high risk (keep it fast and practical).
  • Rollback plan: explicit trigger metrics, steps, expected time to restore.
  • Observability gates: pre-change health, during-change SLO guardrails, post-change validation.
  • Communication: who needs to know before/during/after; single source of truth change record.

Worked examples

Example 1: Database migration adding a non-blocking index

  • Context: adds index to a large table. Could cause increased IO and lock contention.
  • Assessment: Impact 3 (performance risk), Likelihood 3 β†’ Risk 9 (Medium).
  • Mitigations: use concurrent index creation, off-peak window, throttle IO, canary on a replica first.
  • Observability: DB CPU/IO, lock waits, API latency p95/p99, error rate.
  • Rollback: drop index if regressions exceed thresholds for 10 minutes; restore snapshot if needed.
  • Plan: dry run in staging β†’ index on replica β†’ promote replica if stable β†’ apply to primary.

Example 2: Hotfix for a critical auth bug

  • Context: production bug blocks logins for some users.
  • Assessment: Impact 4, Likelihood 4 β†’ Risk 16 (Critical) but urgent to fix.
  • Mitigations: emergency change with pair review, canary to 5% traffic, feature flag to disable instantly.
  • Observability: login success rate, error codes, latency.
  • Rollback: toggle flag off, roll back canary, redeploy previous version within 5 minutes.
  • Plan: prepare artifact and flag β†’ canary 5% (10 min) β†’ 25% (10 min) β†’ 100% if stable.

Example 3: Kubernetes node pool upgrade

  • Context: upgrade nodes from v1.25 to v1.27.
  • Assessment: Impact 4 (cluster-wide), Likelihood 2 β†’ Risk 8 (Medium).
  • Mitigations: surge capacity, PodDisruptionBudgets, drain one node at a time, readiness/liveness verified.
  • Observability: request error rate, HPA behavior, pod restarts, node NotReady events.
  • Rollback: stop upgrade, re-create previous node image, cordon and evacuate problematic nodes.
  • Plan: create new pool β†’ move 10% workloads β†’ validate β†’ migrate remaining in batches.

Checklists you can reuse

Pre-change checklist

  • Defined objective and scope (what will and won't change).
  • Risk score computed (impact Γ— likelihood) and category set.
  • Backout plan with explicit triggers and time-to-restore target.
  • Health checks and dashboards identified; alerts tuned.
  • Change window confirmed; on-call informed.
  • Approvals recorded; change note created with owner and timeline.
  • Data backups or snapshots verified (for data-affecting changes).

During-change checklist

  • Start small: canary/partial rollout.
  • Watch key metrics continuously (latency, errors, saturation, business KPIs).
  • Hold time between steps to observe (e.g., 10–15 minutes).
  • Abort if thresholds exceeded; execute rollback immediately.
  • Keep a live log of actions and timestamps.

Post-change checklist

  • Validate success criteria; compare before/after metrics.
  • Update runbooks and change record with outcomes.
  • Communicate completion and any follow-ups.
  • Schedule a brief retro if issues occurred.

Exercises

These mirror the exercises below. Do them in writing, then compare with the solutions. Tip: clarity beats length.

  1. Exercise 1 β€” Score and plan a feature-flagged rollout

    Scenario: You will enable server-side rendering (SSR) for a high-traffic page. It changes caching behavior and increases CPU usage.

    Tasks:

    • Assign impact and likelihood, compute risk, and justify it.
    • Propose a rollout plan (phased percentages, hold times, metrics, abort triggers).
    • Write a 3-step rollback plan with target restore time.
  2. Exercise 2 β€” Emergency patch with guardrails

    Scenario: Apply a kernel security patch on a subset of nodes today.

    Tasks:

    • Define which workloads to move first and why (blast radius control).
    • List observability gates to pass before moving to the next batch.
    • Document approval and communication steps that keep this safe and fast.

Self-check checklist

  • Your risk score includes both impact and likelihood with reasoning.
  • Rollout plan starts small, includes hold times, and names concrete metrics.
  • Rollback has clear triggers, steps, and a time-to-restore target.
  • Stakeholders and on-call are identified with when/how to update them.

Common mistakes and how to self-check

  • Vague rollback: fix by listing exact commands or steps and a timer to abort.
  • No metrics: fix by naming 3–5 metrics plus thresholds (e.g., error rate > 2% for 5 min).
  • Skipping hold times: fix by enforcing a minimum observation window per step.
  • All-at-once rollouts: fix by canarying to a small percent or isolated segment first.
  • Unclear ownership: fix by naming a single change owner with decision authority.
  • Silent changes: fix by pre/post announcements and a single change record.

Practical projects

  • Create a risk template: a one-page form with fields for scope, risk score, rollout, rollback, metrics, approvals.
  • Build a canary checklist specific to one service (metrics, thresholds, hold times).
  • Write a runbook to roll back your main app within 10 minutes, then test it in staging.
  • Define observability gates in your CI/CD (e.g., block promotion if error rate exceeds threshold during smoke test).

Learning path

  1. Start: Learn risk scoring and blast radius. Practice on recent changes.
  2. Intermediate: Add canary/blue-green, feature flags, and rollback automation.
  3. Advanced: Introduce change freeze rules, SLO-based approvals, and automated verification.
  4. Mastery: Measure change failure rate, mean time to restore (MTTR), and continuously improve.

Quick Test

Take the Quick Test below to check your understanding. Everyone can take it for free. Note: only logged-in users will see saved progress and history.

Mini challenge

Pick a real change you plan this week. In 20 minutes, draft: risk score with reasoning, a 3-step canary plan, 3 observability gates, and an explicit rollback with a 10-minute restore target. Share with a teammate for review before executing.

Next steps

  • Use the checklists on your next two changes.
  • Automate one guardrail (e.g., feature flag kill switch or a canary deploy step).
  • After each change, record outcomes and refine thresholds and hold times.

Practice Exercises

2 exercises to complete

Instructions

Scenario: Enable server-side rendering (SSR) for a high-traffic page. CPU usage may increase; caching behavior changes.

  1. Assign impact and likelihood (1–5 each), compute risk, and justify it in 2–3 sentences.
  2. Create a phased rollout: percentages, hold times, metrics, and abort triggers.
  3. Write a 3-step rollback plan and a target time to restore.
Expected Output
A concise change note containing: risk score with reasoning; a phased plan (e.g., 5% β†’ 25% β†’ 100% with 10–15 min holds); named metrics with thresholds; rollback steps with a 10-minute restore target.

Change Risk Management β€” Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Change Risk Management?

AI Assistant

Ask questions about this tool