Why this matters
Canary Releases
Release to a small percentage first (e.g., 1%-5%). Watch error rate, latency, saturation, and business KPIs. Promote or halt based on thresholds.
Feature Flags (Kill Switches)
Wrap behavior in remote-config flags. Turn off risky code paths instantly without redeploying. Keep flags short-lived: add, verify, remove.
Backward/Forward Compatible APIs
Never break existing clients. Add fields (dont remove), accept old and new shapes, version endpoints if necessary. Use contract tests.
Expand-and-Contract DB Migrations
1) Expand schema (add columns/tables). 2) Dual read/write. 3) Backfill data. 4) Switch reads. 5) Contract (remove old). Rollback is possible during expand/dual stages, becomes hard after data changes finalize.
Observability-driven Gates
Promotion/rollback is automatic when metrics cross thresholds: error rate, p99 latency, saturation (CPU/memory), and domain KPIs (e.g., checkout success).
Pre-deployment decision checklist
- [ ] Is the change behind a feature flag with a kill switch?
- [ ] Is the database migration backward-compatible?
- [ ] Do you have a verified rollback artifact (previous image/tag) ready?
- [ ] Canary plan with clear promotion/abort thresholds written down?
- [ ] Runbook with Rollback and Rollforward steps and owners?
- [ ] Monitoring dashboards and alerts linked to the change?
- [ ] Data backfill strategy and validation queries prepared?
- [ ] Communication plan (who to notify on rollback/rollforward)?
Worked examples
Example 1: Adding a new optional API field
Change: Add response field discountApplied to GET /v1/orders.
- Plan: Backward compatible (clients ignore unknown fields). Ship behind flag
feature.discount_field. - Rollout: Canary 5% 25% 100%.
- Rollback: Turn flag off. If still failing, revert to previous image.
- Rollforward: If a bug is found in formatting, hotfix a quick patch and continue rollout.
Example 2: Renaming a DB column with zero downtime
Change: Rename status order_status.
- Expand: Add new column
order_statusnullable. - Dual write: App writes to both columns.
- Backfill: Batch copy old new.
- Switch reads: App prefers
order_statuswith fallback tostatus. - Contract: Remove old column later.
Rollback window: Steps 1-4 allow rollback by switching reads back. After step 5, rollback requires data restore.
Example 3: Hotfix vs rollback decision
Change: New release increases p99 latency by 40% in the canary slice.
- Decision gates: If p99 > 1.4x baseline for 10 minutes, auto-rollback the canary (stop promotion) and alert on-call.
- If the root cause is identified quickly and is risk-free to patch, ship a hotfix (rollforward). Otherwise, roll back fully and schedule a fix.
Step-by-step playbooks
Rollback playbook
- Freeze: Stop promotions; set traffic to last good version.
- Kill switches: Disable related feature flags immediately.
- Revert: Deploy previous image tag to affected services.
- DB considerations: If in expand/dual-write stage, switch reads/writes to old path. If after contract, consider restore-from-backup or targeted data repair.
- Verify: Check errors, latency, and key business metrics return to baseline.
- Communicate: Post-mortem notes, impact window, next steps.
Rollforward playbook (hotfix)
- Stabilize: Keep canary small. Apply minimal-risk patch.
- Test: Run smoke and contract tests.
- Deploy: Ship hotfix to canary progressive rollout.
- Observe: Gate on metrics; abort if thresholds breach.
- Clean up: Remove temporary toggles and any debug code.
Observability: what to watch
- Error budget: Is the change consuming budget too quickly?
- Golden signals: error rate, p95/p99 latency, saturation, traffic.
- Business KPIs: conversions, signups, order placements.
- Data health: backfill lag, duplicate writes, constraint violations.
Self-check questions
- Do you have clear abort thresholds?
- Can you prove the previous version works now?
- Is data consistent after rollback or hotfix?
Common mistakes and how to self-check
- Mistake: Assuming rollback is always possible after data migrations. Self-check: Identify the last safe rollback point before contract steps.
- Mistake: No kill switches. Self-check: Ensure a flag can instantly disable new logic.
- Mistake: Contracting too early. Self-check: Wait until reads are exclusively from the new path and metrics are stable over time.
- Mistake: Canary without gates. Self-check: Write numeric thresholds and timers in the runbook.
- Mistake: No validation queries for backfills. Self-check: Prepare idempotent validation SQL and sampling checks.
Exercises
These mirror the tasks below. Draft your answers, then compare with the solutions.
Exercise 1: Plan a safe rollout for a new required field
Change: You need to introduce a required request field customerType to POST /v1/orders. Some clients may not send it yet.
- Design a plan to ship without breaking clients.
- Provide both rollback and rollforward steps.
- Include database and API compatibility tactics.
- Write promotion/abort thresholds.
Exercise 2: Zero-downtime rename with rollback path
Change: PostgreSQL table orders, rename column status to order_status on a 200M-row table.
- List exact expand/dual/backfill/switch/contract steps.
- Show validation queries and batch strategy.
- Document the last safe rollback point and how to execute it.
- [ ] I defined flags and safe defaults.
- [ ] I documented the last safe rollback point.
- [ ] I wrote concrete metrics and time windows for gates.
- [ ] I included data validation and backfill strategies.
Practical projects
- Create a sample service with a feature-flagged endpoint. Implement a canary rollout script and a rollback runbook. Prove it by flipping the flag and observing metrics.
- Build a migration tool that performs batched backfills with retries and progress logging. Add a dry-run mode and validation queries.
- Write a "deployment decision" checklist template your team can reuse. Include thresholds, escalation contacts, and communication steps.
Learning path
- Week 1: Practice canary rollouts with flags on a non-critical endpoint.
- Week 2: Implement expand-and-contract on a staging DB with synthetic load.
- Week 3: Add automated gates in CI/CD (pre-deploy checks and post-deploy monitors).
- Week 4: Run a game day: simulate an incident, perform rollback, then rollforward.
Next steps
- Turn these patterns into templates and automation (pipelines, scripts, dashboards).
- Remove long-lived flags and dead code after stabilization.
- Share a short post-mortem after each rollback/rollforward to spread learning.
Mini challenge
Your service needs to change an enum value used by clients (e.g., PENDING QUEUED). In 5-7 bullet points, outline a plan that avoids breaking existing clients and includes the last safe rollback point.
Hint
Consider versioning, dual-acceptance in request parsing, mapping both values in reads, and a gradual deprecation window.
About your progress
The quick test below is available to everyone. If youre logged in, your progress and results will be saved automatically.