Menu

Topic 8 of 8

Rollback And Rollforward Strategy

Learn Rollback And Rollforward Strategy for free with explanations, exercises, and a quick test (for API Engineer).

Published: January 21, 2026 | Updated: January 21, 2026

Why this matters

Canary Releases

Release to a small percentage first (e.g., 1%-5%). Watch error rate, latency, saturation, and business KPIs. Promote or halt based on thresholds.

Feature Flags (Kill Switches)

Wrap behavior in remote-config flags. Turn off risky code paths instantly without redeploying. Keep flags short-lived: add, verify, remove.

Backward/Forward Compatible APIs

Never break existing clients. Add fields (dont remove), accept old and new shapes, version endpoints if necessary. Use contract tests.

Expand-and-Contract DB Migrations

1) Expand schema (add columns/tables). 2) Dual read/write. 3) Backfill data. 4) Switch reads. 5) Contract (remove old). Rollback is possible during expand/dual stages, becomes hard after data changes finalize.

Observability-driven Gates

Promotion/rollback is automatic when metrics cross thresholds: error rate, p99 latency, saturation (CPU/memory), and domain KPIs (e.g., checkout success).

Pre-deployment decision checklist

  • [ ] Is the change behind a feature flag with a kill switch?
  • [ ] Is the database migration backward-compatible?
  • [ ] Do you have a verified rollback artifact (previous image/tag) ready?
  • [ ] Canary plan with clear promotion/abort thresholds written down?
  • [ ] Runbook with Rollback and Rollforward steps and owners?
  • [ ] Monitoring dashboards and alerts linked to the change?
  • [ ] Data backfill strategy and validation queries prepared?
  • [ ] Communication plan (who to notify on rollback/rollforward)?

Worked examples

Example 1: Adding a new optional API field

Change: Add response field discountApplied to GET /v1/orders.

  • Plan: Backward compatible (clients ignore unknown fields). Ship behind flag feature.discount_field.
  • Rollout: Canary 5%  25%  100%.
  • Rollback: Turn flag off. If still failing, revert to previous image.
  • Rollforward: If a bug is found in formatting, hotfix a quick patch and continue rollout.

Example 2: Renaming a DB column with zero downtime

Change: Rename status  order_status.

  1. Expand: Add new column order_status nullable.
  2. Dual write: App writes to both columns.
  3. Backfill: Batch copy old  new.
  4. Switch reads: App prefers order_status with fallback to status.
  5. Contract: Remove old column later.

Rollback window: Steps 1-4 allow rollback by switching reads back. After step 5, rollback requires data restore.

Example 3: Hotfix vs rollback decision

Change: New release increases p99 latency by 40% in the canary slice.

  • Decision gates: If p99 > 1.4x baseline for 10 minutes, auto-rollback the canary (stop promotion) and alert on-call.
  • If the root cause is identified quickly and is risk-free to patch, ship a hotfix (rollforward). Otherwise, roll back fully and schedule a fix.

Step-by-step playbooks

Rollback playbook

  1. Freeze: Stop promotions; set traffic to last good version.
  2. Kill switches: Disable related feature flags immediately.
  3. Revert: Deploy previous image tag to affected services.
  4. DB considerations: If in expand/dual-write stage, switch reads/writes to old path. If after contract, consider restore-from-backup or targeted data repair.
  5. Verify: Check errors, latency, and key business metrics return to baseline.
  6. Communicate: Post-mortem notes, impact window, next steps.

Rollforward playbook (hotfix)

  1. Stabilize: Keep canary small. Apply minimal-risk patch.
  2. Test: Run smoke and contract tests.
  3. Deploy: Ship hotfix to canary  progressive rollout.
  4. Observe: Gate on metrics; abort if thresholds breach.
  5. Clean up: Remove temporary toggles and any debug code.

Observability: what to watch

  • Error budget: Is the change consuming budget too quickly?
  • Golden signals: error rate, p95/p99 latency, saturation, traffic.
  • Business KPIs: conversions, signups, order placements.
  • Data health: backfill lag, duplicate writes, constraint violations.
Self-check questions
  • Do you have clear abort thresholds?
  • Can you prove the previous version works now?
  • Is data consistent after rollback or hotfix?

Common mistakes and how to self-check

  • Mistake: Assuming rollback is always possible after data migrations. Self-check: Identify the last safe rollback point before contract steps.
  • Mistake: No kill switches. Self-check: Ensure a flag can instantly disable new logic.
  • Mistake: Contracting too early. Self-check: Wait until reads are exclusively from the new path and metrics are stable over time.
  • Mistake: Canary without gates. Self-check: Write numeric thresholds and timers in the runbook.
  • Mistake: No validation queries for backfills. Self-check: Prepare idempotent validation SQL and sampling checks.

Exercises

These mirror the tasks below. Draft your answers, then compare with the solutions.

Exercise 1: Plan a safe rollout for a new required field

Change: You need to introduce a required request field customerType to POST /v1/orders. Some clients may not send it yet.

  • Design a plan to ship without breaking clients.
  • Provide both rollback and rollforward steps.
  • Include database and API compatibility tactics.
  • Write promotion/abort thresholds.

Exercise 2: Zero-downtime rename with rollback path

Change: PostgreSQL table orders, rename column status to order_status on a 200M-row table.

  • List exact expand/dual/backfill/switch/contract steps.
  • Show validation queries and batch strategy.
  • Document the last safe rollback point and how to execute it.
  • [ ] I defined flags and safe defaults.
  • [ ] I documented the last safe rollback point.
  • [ ] I wrote concrete metrics and time windows for gates.
  • [ ] I included data validation and backfill strategies.

Practical projects

  • Create a sample service with a feature-flagged endpoint. Implement a canary rollout script and a rollback runbook. Prove it by flipping the flag and observing metrics.
  • Build a migration tool that performs batched backfills with retries and progress logging. Add a dry-run mode and validation queries.
  • Write a "deployment decision" checklist template your team can reuse. Include thresholds, escalation contacts, and communication steps.

Learning path

  1. Week 1: Practice canary rollouts with flags on a non-critical endpoint.
  2. Week 2: Implement expand-and-contract on a staging DB with synthetic load.
  3. Week 3: Add automated gates in CI/CD (pre-deploy checks and post-deploy monitors).
  4. Week 4: Run a game day: simulate an incident, perform rollback, then rollforward.

Next steps

  • Turn these patterns into templates and automation (pipelines, scripts, dashboards).
  • Remove long-lived flags and dead code after stabilization.
  • Share a short post-mortem after each rollback/rollforward to spread learning.

Mini challenge

Your service needs to change an enum value used by clients (e.g., PENDING  QUEUED). In 5-7 bullet points, outline a plan that avoids breaking existing clients and includes the last safe rollback point.

Hint

Consider versioning, dual-acceptance in request parsing, mapping both values in reads, and a gradual deprecation window.

About your progress

The quick test below is available to everyone. If youre logged in, your progress and results will be saved automatically.

Practice Exercises

2 exercises to complete

Instructions

Introduce a required request field customerType to POST /v1/orders without breaking existing clients.

  • Design an expand-and-contract plan using feature flags, validation defaults, and clear gates.
  • Write concrete steps for rollback and rollforward.
  • Define metrics thresholds and decision timers.
Expected Output
A written runbook with phases, flag names, rollout stages (5%/25%/50%/100%), abort thresholds, and explicit rollback and rollforward commands or steps.

Rollback And Rollforward Strategy — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Rollback And Rollforward Strategy?

AI Assistant

Ask questions about this tool