Why this matters
As a Platform Engineer, you enable teams to ship safely and fast. Automated promotions move changes forward when health signals look good. Automated rollbacks protect users when things go wrong. Together, they reduce toil, mean time to recovery (MTTR), and release anxiety.
- Real task: Design a pipeline that canary releases a new service version, auto-promotes if metrics stay healthy, and auto-rolls back if error rate spikes.
- Real task: Define promotion gates (latency, error %, saturation) and bake them into CI/CD so decisions are data-driven, not manual.
- Real task: Ensure rollbacks are fast, reversible, and regularly tested, including database and config changes.
Note: The Quick Test is available to everyone. Only logged-in learners have their test progress saved.
Concept explained simply
Automated promotion moves a release from one stage to the next (e.g., canary → 25% → 50% → 100%) when health checks meet agreed thresholds for a set duration.
Automated rollback quickly returns traffic to the last known good version when health checks fail beyond thresholds, without waiting for manual approval.
Mental model
Imagine an airlock with sensors. Each door opens (promotion) only if sensors stay green long enough. If a sensor turns red, doors snap shut and you step back (rollback). You predefine:
- Signals: What you measure (e.g., HTTP 5xx %, p95 latency, readiness, key business success rate).
- Thresholds: Acceptable limits (e.g., 5xx < 1% over 10 min).
- Windows: How long the metrics must hold.
- Actions: Promote, pause, or rollback.
Key building blocks
- Immutable artifacts: The same build goes from staging to production.
- Progressive delivery: Canary, blue/green, or feature flags to control blast radius.
- Health signals: SLO-based metrics (latency, errors, availability), resource health, and a small set of critical business metrics.
- Gates: Policy that promotes, pauses, or rolls back based on signals and time windows.
- Rollback plans: Pre-defined actions for app, config, and data (including backwards-compatible DB changes).
- Observability: Logs, metrics, traces to detect regressions quickly.
- Runbooks: Documented steps for edge cases and verifying post-rollback state.
Worked examples
Example 1: Kubernetes canary with metric gates
Scenario: Deploy v1.3 of a service. Start canary at 10% traffic. Promote to 50% then 100% if healthy. Rollback if 5xx or latency breach.
{"strategy":"canary","steps":[{"setWeight":10,"hold":"10m","gates":{"http_5xx_pct":"<1%","p95_latency_ms":"<+15% vs baseline"}},{"setWeight":50,"hold":"15m","gates":{"http_5xx_pct":"<0.8%","p95_latency_ms":"<+10%"}},{"setWeight":100}] ,"rollback":{"toWeight":0,"reason":"gate_breach"}}
What happens: The controller compares canary metrics to baseline (current production). If any gate breaches during the hold, it reverts traffic to 0% and marks the rollout failed.
Example 2: Blue/green with manual pause and auto-promotion
Scenario: Spin up a green stack alongside blue. Run synthetic checks and a small shadow traffic mirror. If checks pass for 15 minutes, switch the router to green. If checks fail, tear down green and keep blue.
- Phase A: Provision green. Run health endpoints, migrations in safe mode.
- Phase B: Mirror 5% of read-only traffic to green (no user impact).
- Gate: 0 test failures AND p95 latency within +10% of blue for 15 minutes.
- Promote: Flip router from blue → green atomically.
- Rollback: Flip back to blue. Green remains for debugging, then decommission.
Example 3: Feature-flagged risky code path
Scenario: Release code dark with flag off. Gradually increase flag exposure 1% → 10% → 50% → 100% based on metrics. Auto rollback means flipping the flag off, not redeploying.
- Gate: 5xx < 0.7%, p95 latency < +8%, checkout success > 98% over 10 minutes.
- Promotion: Increase exposure to next stage.
- Rollback: Immediately set exposure to 0% if gate breaches.
Quality gates and practical thresholds
- Error rate (HTTP 5xx %): breach if > 1% for 5 consecutive minutes.
- Latency (p95): breach if > +15% compared to baseline for 10 minutes.
- Availability: breach if < 99.5% during the hold window.
- Resource health: breach if OOM kills detected or restart rate > 3 per pod in 10 minutes.
- Business guardrail: sign-in or checkout success below target for 2 consecutive check intervals.
Use short windows for fast feedback, but add anti-flapping: require multiple consecutive breaches before rollback.
Designing rollbacks that actually work
- DB migrations: Prefer backwards-compatible, two-step migrations (expand → contract). Rollback should not require dropping data.
- Config changes: Version configs and be able to revert independently of code.
- Stateful services: Ensure schema and serialization are compatible across versions.
- Idempotency: Deploy and rollback steps should be safe to re-run.
- Dry runs: Test rollback paths in a staging or ephemeral env on every PR when feasible.
Runbook template (copy/paste)
Title: Service X rollout with auto-promotion and auto-rollback
- Signals: 5xx %, p95 latency, pod restarts, key business KPI.
- Thresholds: define per release (see release notes).
- Promotion steps: 10% (10m) → 50% (15m) → 100%.
- Rollback trigger: any breach for 2 consecutive check intervals.
- Rollback action: traffic to previous stable, disable flag if applicable, revert config version.
- Post-rollback checks: SLOs back to normal, error budgets unchanged, user impact resolved.
Implementation steps (from zero to safe rollouts)
Exercises
These mirror the interactive tasks below. Try them before opening the solutions.
Exercise 1: Define promotion gates for a new API (ex1)
Create promotion gates for a canary rollout: 10% → 50% → 100%. Include thresholds for 5xx %, p95 latency, and a simple business metric (e.g., login success).
- Output format: a small JSON or YAML block with gates and holds.
- Constraint: canary must hold at least 10 minutes at 10% and 15 minutes at 50%.
Exercise 2: Write a rollback plan for Kubernetes (ex2)
Draft a step-by-step rollback plan that reverts traffic to the last stable version within 2 minutes. Include verification steps and what to do if DB changes were applied.
Exercise 3: Decide promote/rollback from signals (ex3)
Given: After 12 minutes at 10%, metrics show 5xx = 1.4% (baseline 0.3%), p95 latency +22% vs baseline, restarts = 0, business KPI unchanged. Decide to promote, pause, or rollback, and justify.
Self-check checklist
- Did you set both absolute and relative thresholds?
- Do gates include at least one user-facing business metric?
- Is rollback idempotent and fast (under 2 minutes)?
- Did you include post-rollback verification steps?
Common mistakes and how to self-check
- Gating on the wrong metrics: Only CPU/Memory. Fix: Add latency, error %, and one business KPI.
- Too sensitive gates (flapping): Single spike triggers rollback. Fix: require consecutive breaches or a rolling window.
- Unrehearsed rollbacks: Plan exists but never tested. Fix: Regularly simulate failures.
- DB coupling: Non-backwards-compatible migrations. Fix: expand/contract and feature flags.
- Skipping baseline comparison: New version judged in isolation. Fix: compare against current production.
- Long hold at 100% only: Issues surface earlier. Fix: meaningful holds at partial traffic.
Practical projects
- Build a canary pipeline for a sample service with 10% → 50% → 100% and metric gates. Include auto-rollback.
- Implement blue/green for a web app, with synthetic checks and a one-click flip plus instant rollback.
- Add a feature flag to a risky endpoint and automate exposure stages tied to SLOs.
Learning path
- Before: CI basics, containerization, Kubernetes or your runtime platform, observability fundamentals.
- This topic: Automated gates, promotions, and rollbacks.
- After: Policy as code, GitOps workflows, error budgets and SLO-based release policies, disaster recovery drills.
Who this is for
- Platform Engineers and SREs building release platforms.
- Backend engineers owning services and on-call.
- Tech leads seeking safer, faster delivery practices.
Prerequisites
- Comfort with CI/CD pipelines and YAML-based configs.
- Basic Kubernetes (or your deployment platform) knowledge.
- Understanding of service metrics: latency, errors, saturation.
Next steps
- Complete the Quick Test to validate understanding. Note: only logged-in users have their progress saved.
- Pick one Practical project and implement it this week.
- Schedule a recurring rollback drill with your team.
Mini challenge
Your service adds an indexing layer that may increase write latency. Design a rollout plan with automated promotions and rollbacks. Include:
- Promotion ladder and hold times.
- Metrics and thresholds (relative to baseline).
- Rollback plan, including DB considerations.
- How you will test the rollback path before production.