Who this is for
- Platform Engineers setting up resilient, low-latency services.
- Backend Engineers who deploy globally and need disaster recovery.
- SREs planning failover, capacity, and incident response across regions.
Prerequisites
- Basic networking: DNS, load balancers, VPCs, subnets.
- Fundamentals of high availability (HA) and redundancy.
- Comfort with containerized apps or VM deployments.
Learning path
- Before this: single-region HA, DNS basics, health checks.
- In this lesson: regions vs AZs, active-active vs active-passive, RPO/RTO, data and traffic strategies.
- After this: global load balancers, cross-region database replication patterns, IaC for multi-region topologies, DR testing.
Why this matters
- Design disaster recovery so an entire region can go down with minimal impact.
- Reduce user latency by serving from closer regions.
- Meet compliance and data residency rules.
- Control costs with the right replication and routing choices.
Concept explained simply
A region is a geographic cluster of datacenters. An Availability Zone (AZ) is an isolated datacenter within a region. Multi-AZ protects you from a single building failure. Multi-region protects you from a regional outage.
Two common patterns:
- Active-active: multiple regions serve traffic at the same time. Lower latency; higher complexity for data consistency.
- Active-passive: one region serves traffic; another is on standby. Simpler; slightly longer recovery time.
RPO (Recovery Point Objective) is the maximum acceptable data loss window. RTO (Recovery Time Objective) is how quickly you must restore service. Choose replication and failover that meet both.
Mental model
Imagine two warehouses (regions) with the same products (services and data). Active-active means both ship orders; inventories must stay in sync. Active-passive means one ships while the other stays ready to take over.
- Routing = which warehouse ships your package (DNS/global LB).
- Data sync = keeping inventories aligned (replication strategy).
- Runbooks = the playbook people follow when one warehouse is unavailable.
Core building blocks
Regions, AZs, and blast radius
- Multi-AZ: redundancy inside one region.
- Multi-region: isolates regional failures (power, fiber cuts, provider issues).
- Design to limit blast radius: fail fast locally, fall back globally.
Traffic routing
- Latency-based DNS or global load balancer to steer users to the nearest healthy region.
- Health checks and automatic failover to remove an unhealthy region.
- Keep DNS TTL low enough (for example, 30–60s) to fail over without causing cache storms.
Data and consistency
- Leader-follower (single-writer) with async replication: common, simple, small potential data loss (RPO = replication lag).
- Multi-leader (multi-writer): complex conflict resolution and higher write latency across regions.
- Eventual vs strong consistency trade-offs depend on business needs.
State and sessions
- Prefer stateless services; store session state in cookies/JWT or a globally replicated store.
- Use sticky sessions only if you must; they reduce resilience.
Networking between regions
- Private interconnects, VPN, or provider transit backbones to move data securely.
- Consider bandwidth, egress costs, and encryption in transit.
Deployment, testing, and runbooks
- Automate regional rollouts (blue/green or canary).
- Rehearse failovers (game days). Document clear, ordered steps.
Observability and time
- Centralized metrics and logs with region tags.
- Consistent time sync and dashboards to compare regions side-by-side.
Cost and compliance
- Cross-region data transfer is billed as egress by most providers.
- Duplicate infrastructure doubles some costs; right-size each region.
- Honor data residency: keep certain data in-region and replicate only permitted aggregates.
Worked examples
Example 1 — Active-active web tier with single-writer database
- Goal: global latency under 200ms for most users; RPO ≤ 5 min; RTO ≤ 30 min.
- Pattern: active-active stateless frontends in two regions; single write-primary database in Region A; async read replica in Region B.
- Routing: latency-based global DNS/GLB with health checks.
- Sessions: stateless JWT; no stickiness required.
- Failover: if Region A fails, promote Region B replica, flip write endpoint, reroute traffic. Expected RPO ~ replication lag (e.g., seconds). RTO driven by promotion + routing time.
Example 2 — Active-passive API with cold standby
- Goal: minimize cost while meeting RPO 15 min, RTO 60 min.
- Pattern: Region A serves; Region B keeps IaC-defined but scaled-down instances and a warm DB replica.
- Routing: primary DNS record for Region A; failover record for Region B.
- Failover steps: scale up Region B, promote DB, switch DNS, verify, then gradually scale traffic.
Example 3 — Read-local, write-remote pattern
- Goal: fast reads everywhere, writes go to a single consistent source.
- Pattern: cache hot data in each region with short TTL; write-through to primary region; background invalidations on update events.
- Trade-off: slightly higher write latency for global consistency; reads are very fast.
Practical setup steps
- Choose two regions close to major user clusters; document compliance constraints.
- Make services stateless; externalize sessions to JWT or a shared store.
- Select database topology: start with single-writer + async cross-region replica.
- Configure global DNS/GLB with health checks and a 30–60s TTL.
- Secure inter-region links and monitor replication lag.
- Write a clear failover runbook; include promotion, traffic shift, and verification steps.
- Run a game day: simulate a region outage and measure RPO/RTO.
Exercises
Exercise 1: Design a two-region architecture for a global web app
Constraints: two regions (for example, us-east, eu-west), global audience, target user latency under 200ms for most users, RPO ≤ 5 minutes, RTO ≤ 30 minutes, managed database with cross-region read replicas. Non-functional: sessions must work across regions; minimize data loss; use health-checked failover.
Tasks:
- Choose active-active or active-passive.
- Define routing method, session approach, database topology, cache design, and failover runbook.
- Pick DNS TTL and alerting signals for failover.
Show solution
See the solution in the Exercises section below or open the solution toggle in the exercise card.
Exercise 2: RPO/RTO and failover runbook ordering
Part A: You have async cross-region replication with an average lag of 15 seconds and continuous streaming. Primary region goes down hard. What is your worst-case RPO?
Part B: Order these steps for failover to the secondary region:
- Promote replica to primary (enable writes).
- Set global read-only or stop writes to the failed region.
- Shift traffic to secondary via GLB/DNS.
- Warm critical caches (or accept cold-start).
- Resume normal read-write traffic.
Show solution
See the solution in the Exercises section below or open the solution toggle in the exercise card.
- Self-check items:
- Did you specify RPO and RTO and map them to replication and routing?
- Did you keep services stateless or define a clear session store?
- Did you define health checks and realistic DNS TTL?
- Did your failover order prevent split-brain writes?
Common mistakes and how to self-check
- Mistake: Confusing multi-AZ with multi-region. Self-check: Can a regional fiber cut still impact you? If yes, add a second region.
- Mistake: Ignoring session state. Self-check: Kill a region in staging; do users stay logged in?
- Mistake: DNS TTL too high. Self-check: How long until 80% of users move to the new region?
- Mistake: Unbounded replication lag. Self-check: Alert if lag exceeds your RPO threshold.
- Mistake: No runbook rehearsal. Self-check: Schedule and complete a quarterly game day.
Practical projects
- Two-region demo app: static frontends + API in two regions, latency-based routing, single-writer DB with replica. Acceptance: manual failover completes in under 20 minutes with ≤ 30 seconds data loss.
- DR drill in staging: break primary region networking, promote secondary, and restore. Acceptance: documented timings for each step and updated runbook.
- Data residency prototype: keep PII in-region and replicate only aggregates. Acceptance: automated checks ensure PII tables never cross borders.
Next steps
- Learn global load balancing strategies and health check design.
- Deep dive into cross-region database replication and conflict resolution.
- Automate multi-region deployments with infrastructure as code and pipelines.
Mini challenge
Write a one-page architecture brief for a multi-region service with RPO 1 minute and RTO 10 minutes. Specify routing, database topology, failover triggers, TTLs, and a verification checklist. Keep it implementable by a small team.
Quick test
Take the quick test below to check understanding. Available to everyone; sign in to save your progress.