Menu

Topic 4 of 8

Blue Green And Canary Releases

Learn Blue Green And Canary Releases for free with explanations, exercises, and a quick test (for API Engineer).

Published: January 21, 2026 | Updated: January 21, 2026

Why this matters

As an API Engineer, you ship changes that other teams and customers rely on. Blue-green and canary releases reduce downtime and risk by letting you test new versions safely in production, roll back fast, and verify real traffic before full rollout.

  • Release new API versions with near-zero downtime
  • Catch regressions with real production traffic on a small subset
  • Roll back instantly if health metrics degrade
  • Ship database changes without breaking consumers

Concept explained simply

Two core patterns:

  • Blue-Green: Run two identical environments. Blue = live. Green = new version. Switch traffic to Green when ready. If something breaks, switch back to Blue.
  • Canary: Gradually send a small percentage of traffic to the new version (the canary). Watch metrics. If good, increase; if bad, stop or roll back.
Mental model

Think of the load balancer as a volume knob. Blue-green flips the source (track A to track B) instantly. Canary slowly raises the volume of the new track while lowering the old, listening for static (errors) as you go.

Prerequisites

  • Stateless app instances or sticky sessions handled at the edge
  • Automated builds and deployments (CI/CD)
  • Health checks and readiness probes
  • Metrics and alerting (latency, error rates, saturation)
  • Feature flags for risky changes (optional but helpful)

When to use what

  • Use Blue-Green when you need instant rollback and your app is easy to duplicate (typical for stateless APIs).
  • Use Canary when risk is higher, user segments matter, or you need to measure impact gradually.
  • Combine both: deploy Green, then do a canary shift to Green before full switchover.

Step-by-step flows

Blue-Green via Load Balancer (step cards)
  1. Prepare Green: Deploy the new version in parallel to Blue. Keep it out of public traffic.
  2. Warm up: Run database migrations that are backward-compatible. Pre-warm caches if possible.
  3. Validate: Run smoke tests and health checks against Green.
  4. Shift: Update the load balancer to route 100% traffic from Blue to Green atomically.
  5. Observe: Watch error rate, latency, CPU/memory, and logs for a defined bake time (e.g., 15–30 minutes).
  6. Rollback (if needed): Switch LB back to Blue immediately.
  7. Decommission: After a safe window, shut down Blue or keep it as fallback briefly.
Canary via Service Mesh or LB Weights (step cards)
  1. Deploy new version alongside old version.
  2. Start at 1–5% traffic to the canary.
  3. Define promotion gates: SLO error rate, p95 latency, and request volume thresholds.
  4. Increase traffic in steps (e.g., 1% β†’ 5% β†’ 10% β†’ 25% β†’ 50% β†’ 100%), checking metrics at each step.
  5. Auto-rollback rule: if metrics exceed thresholds, drop canary to 0% and page the on-call.
  6. Complete rollout: retire the old version after steady-state success.
Database-safe releases (mini tasks)
  • Additive migrations first (new columns/tables) β†’ deploy app compatible with both old and new β†’ remove old fields later.
  • Avoid destructive changes during rollout. Use expand-and-contract.
  • Backfill data asynchronously before flipping read paths.

Worked examples

Example 1: Stateless REST API (Blue-Green)

Context: v1.7 to v1.8; adds a new endpoint; no schema drops.

  1. Deploy v1.8 to Green with readiness probes enabled.
  2. Run smoke tests: 200 on critical endpoints; check logs.
  3. Migrate DB: add nullable column; no drops.
  4. Flip LB to Green; monitor: error rate < 1%, p95 latency < 300ms for 20 min.
  5. Success β†’ keep Blue on standby for 1 hour, then retire.
Example 2: Feature with risky cache changes (Canary)

Context: new caching layer might cause stale reads.

  1. Deploy new version behind a feature flag.
  2. Start canary at 5% with flag ON for canary only.
  3. Watch cache hit ratio, origin latency, and 5xx. Thresholds: 5xx < 0.5%, p95 < 400ms.
  4. Increase to 25% β†’ 50% if stable; otherwise auto-rollback and disable flag.
Example 3: WebSockets and long-lived connections

Context: Blue-Green with sticky sessions.

  1. Drain Blue by stopping new connections; keep existing until TTL (e.g., 5 minutes).
  2. Shift new connections to Green; verify handshake success and message delivery.
  3. After drain window, terminate remaining Blue connections and finalize.

Monitoring and rollback

  • Key metrics: error rate, p95/p99 latency, saturation (CPU/mem), request volume, dependency health.
  • Release gates: declare thresholds before deploying; automate checks.
  • Rollback plan: one-click or one-command revert; keep last stable artifacts ready.
Example thresholds
  • Error rate: < 1% overall, < 2% per endpoint
  • p95 latency: within 20% of baseline
  • Dependency errors: no spike > 1.5x baseline

Traffic routing tips

  • Use LB weights or service mesh for precise percentages.
  • DNS-based cutovers need low TTL to avoid long propagation.
  • Handle sessions: prefer stateless JWT or stickiness during transitions.
  • Shadow traffic (optional): mirror requests to new version without affecting users to observe behavior.

Edge cases to plan for

  • Background jobs: version them; avoid double-processing on rollback.
  • Caches: coordinate cache keys between versions to avoid pollution.
  • Idempotency: ensure retries won’t create duplicate side effects.
  • Rate limits: treat canary traffic as normal users to get realistic metrics.

Who this is for

  • API Engineers and Backend Developers shipping services regularly
  • Platform/DevOps engineers building safe deployment pipelines
  • Team leads defining release policies and SLOs

Common mistakes and self-check

  • Skipping health checks β†’ Self-check: Are readiness and liveness probes required and tested?
  • Destructive DB changes during rollout β†’ Self-check: Are you using expand-and-contract?
  • Unclear rollback triggers β†’ Self-check: Are thresholds written and automated?
  • No bake time β†’ Self-check: Is there a timed observation window before decommissioning?
  • Forgetting background workers β†’ Self-check: Do workers align with app versioning?

Exercises (build confidence)

These exercises are also listed below in the Exercises section with space for your answers. Progress saving is available for logged-in users; the exercises and test are available to everyone.

  1. Exercise 1: Plan a blue-green deployment for a stateless API with instant rollback.
  2. Exercise 2: Design a canary policy with traffic steps and promotion/rollback gates.
  • Checklist: have you defined metrics, thresholds, bake time, and rollback commands?

Practical projects

  • Create a demo service with two versions and implement blue-green cutover using an NGINX or cloud LB config.
  • Implement a canary rollout in Kubernetes using a service mesh or weighted services.
  • Perform a safe DB migration with expand-and-contract and verify both versions run concurrently.

Learning path

  • Start with health checks and instrumentation (metrics/logging).
  • Automate CI/CD build and deploy steps.
  • Add blue-green release to one service; document rollback procedure.
  • Introduce canary for riskier changes; codify thresholds and gates.
  • Scale to multiple services with shared policies and templates.

Next steps

  • Do the exercises below and compare with the sample solutions.
  • Take the quick test to validate understanding.
  • Apply one strategy to your next deployment and review results in a retro.

Mini challenge

You must deploy a version that changes a response field name. Propose a plan that avoids breaking clients and supports fast rollback. Keep the old field available while introducing the new one, and show how you will remove it later.

Practice Exercises

2 exercises to complete

Instructions

You maintain an orders API. You need near-zero downtime and instant rollback. Draft a blue-green plan.

  • List components you will duplicate for Green (app instances, config, secrets, environment variables).
  • Define health checks and smoke tests before cutover.
  • Specify the exact cutover action (e.g., LB target group swap) and a 20-minute bake time.
  • Define metrics and thresholds (error rate, p95) and the rollback trigger.
  • Include a step to keep Blue available for 1 hour post-cutover before decommissioning.
Expected Output
A concise deployment runbook (5–10 steps) with thresholds and a one-command rollback.

Blue Green And Canary Releases β€” Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Blue Green And Canary Releases?

AI Assistant

Ask questions about this tool