Who this is for
- Backend engineers preparing for system design interviews.
- Developers running services that need uptime, scalability, and predictable latency.
- Anyone moving from single-instance apps to multi-instance, highly available systems.
Prerequisites
- Basic HTTP/HTTPS knowledge (methods, headers, status codes).
- Familiarity with stateless vs stateful services.
- Comfort with reading simple architecture diagrams and logs.
Why this matters
In real backend work you will:
- Split traffic across multiple instances to remove single points of failure.
- Keep services responsive during spikes and deploys.
- Roll out new versions safely (canary/blue-green) without customer impact.
- Route traffic by path/host to the right service (APIs, websockets, static files).
Real tasks you might do
- Add a new API pod and verify the load balancer starts sending traffic only after it passes health checks.
- Investigate uneven traffic: adjust from round-robin to least-connections.
- Introduce cookie-based session affinity for a short-lived experiment, then remove it after moving sessions to a shared store.
- Set up connection draining so in-flight requests finish during rolling deployments.
Concept explained simply
A load balancer (LB) sits in front of multiple server instances and decides which instance handles each incoming request or connection. The goal: share work fairly, keep latency low, and survive instance failures.
- Layer 4 (L4): routes by IP/port (fast, protocol-agnostic: TCP/UDP).
- Layer 7 (L7): understands HTTP(S)/gRPC; can route by path, host, headers, or method.
Core jobs of an LB:
- Distribute: choose a target using an algorithm.
- Health check: only send to healthy instances.
- Retry or fail fast: if a target fails, pick another where safe.
- Observe: expose metrics (RPS, errors, latency, active connections).
Mental model
Think of the LB as an air-traffic controller. Planes (requests) arrive constantly. The controller checks which runways (instances) are open and least busy, then assigns planes. If a runway closes, planes are diverted. The controller also decides special rules (e.g., heavy cargo goes to the long runway).
Algorithms and when to use
Round Robin
Cycles through instances in order. Simple and good when instances are identical and requests are similar.
Weighted Round Robin
Like round robin but gives more traffic to stronger instances (e.g., more CPU). Good when capacity differs.
Least Connections
Sends new requests to the instance with the fewest active connections. Better when request durations vary.
Least Response Time
Chooses the instance with the lowest observed latency and fewest connections. Good for latency-sensitive APIs.
Random (Power of Two Choices)
Pick two random instances and choose the less loaded of the two. Near-optimal with low overhead.
IP Hash / Source Affinity
Clients with the same IP land on the same instance. Useful for simple stickiness, but breaks with NAT or mobile networks.
Cookie-based Session Affinity
LB sets a cookie so a user keeps hitting the same instance. Quick fix for stateful apps, but can cause hot spots and complicate autoscaling.
Consistent Hashing
Maps keys (like userID) to instances with minimal remapping when instances change. Great for caches and sharded state.
Essential features
- Health checks: active (periodic probes) and passive (watching failures). Configure timeouts and thresholds to avoid flapping.
- Connection draining: stop sending new requests to an instance during deploy; let in-flight finish.
- Slow start: ramp up traffic to new instances gradually.
- Retries with budgets: only retry idempotent requests; cap total retries to avoid storms.
- TLS termination: LB handles HTTPS; backends speak HTTP or mTLS as needed.
- Path/host routing: send /api to API service, /assets to static servers, different hosts to different apps.
- Rate limiting and WAF rules: protect downstreams.
L4 vs L7 quick guide
- Choose L4 when you need protocol-agnostic speed (TCP, gRPC with passthrough, raw websockets) and minimal logic.
- Choose L7 when you need smart routing, header inspection, HTTP/2 features, or per-route policies.
Worked examples
Example 1: Spiky API traffic
Problem: Traffic spikes; some requests slow down. Instances are identical, request durations vary a lot.
Approach: Switch algorithm from round robin to least connections or power-of-two-choices. Enable slow start for new instances. Add passive health checks and set connection draining during deploys.
Outcome: Better distribution under uneven workloads; fewer tail latencies.
Example 2: Stateful sessions
Problem: A legacy app stores sessions in memory. Users get logged out on failover.
Approach: Temporarily enable cookie-based session affinity on the LB. As a long-term fix, move sessions to a shared store (e.g., Redis) and remove affinity to improve balancing and resilience.
Outcome: Immediate stability with a path to stateless scaling.
Example 3: Cache sharding
Problem: A distributed cache must keep keys on consistent nodes during scale-out.
Approach: Use consistent hashing at the LB or client library so when nodes are added/removed only a small fraction of keys move.
Outcome: Higher cache hit rate and smoother scaling.
Step-by-step: design a load-balanced API
- Define constraints: target RPS, p95 latency, failure domains (zones/regions), and peak vs average load.
- Choose LB layer: L7 for smart routing and observability; L4 for raw speed or non-HTTP protocols.
- Pick algorithm: start with round robin; prefer least connections or power-of-two for uneven workloads.
- Health checks: set endpoint, timeout, interval, healthy/unhealthy thresholds.
- Deploy safety: enable connection draining and slow start; cap retries per request.
- State and stickiness: avoid stickiness by externalizing state; if needed, use short-lived cookie affinity.
- Autoscaling signals: scale on CPU + request latency + queue depth, not just one metric.
- Observe: track RPS, active conns, p95/p99 latency, 5xx rates per instance; alert on anomalies.
Mini tasks for each step
- Write a 1-sentence SLO: “p95 < 200ms at 500 RPS.”
- Choose: L4 or L7? State your reason in one line.
- Pick your initial algorithm and when you’d switch.
- Define health check path, interval (s), and thresholds (n).
- Set a drain timeout and retry budget (max retries per request).
- Decide: sticky on/off? If on, how to remove later?
- Define two autoscaling triggers you trust.
- Select 3 dashboards: per-target latency, errors, saturation.
Exercises
These mirror the exercises below. Do them here, then record your answers in your notes.
Exercise 1 — Pick and justify a strategy
Scenario: You run a read-heavy REST API with occasional long-running requests. Two instance types: 2 large and 3 medium. You deploy twice per day and need safe rollouts.
- Choose L4 or L7 and justify.
- Pick a balancing algorithm and whether to weight instances.
- Define health check path, interval, timeout, healthy/unhealthy thresholds.
- Set connection draining and slow start values.
- Define a retry policy (which requests are safe to retry?).
- Decide if you need stickiness. If yes, how to phase it out?
Self-check checklist
- Did you avoid retrying non-idempotent requests?
- Did you account for different instance sizes?
- Did you include drain/slow-start for deploys?
- Is your health check specific and fast?
- Is there a plan to remove stickiness (if used)?
Common mistakes and how to self-check
- Relying on sticky sessions forever. Self-check: can any instance handle any request after a cache warm-up?
- No connection draining. Self-check: do deploys cause visible 499/5xx spikes?
- Only CPU-based autoscaling. Self-check: add latency/queue depth signals.
- Aggressive retries. Self-check: ensure idempotency; set retry budgets.
- Weak health checks. Self-check: probe a lightweight but app-specific endpoint that confirms dependencies.
- One AZ only. Self-check: simulate losing an AZ; does traffic rebalance cleanly?
- Ignoring p99 latency. Self-check: track tail latency per instance and per route.
Practical projects
- Spin up two API instances and place Nginx/HAProxy in front. Try round robin vs least connections. Record latency changes under mixed workloads.
- Enable cookie-based affinity, trigger a deploy, and observe how draining + slow start affect errors. Then remove stickiness after moving sessions to a shared store.
- Simulate consistent hashing with a small script. Add/remove nodes and measure the percentage of keys that move.
- Add health checks that verify DB connectivity. Kill one instance and confirm the LB stops sending traffic within your thresholds.
Learning path
- Before: HTTP fundamentals, stateless design, basic observability.
- Now: Load balancing algorithms, health checks, connection management.
- Next: Rate limiting, caching layers, service discovery, global traffic management (multi-region, DNS-based GSLB).
Next steps
- Complete the exercise and compare with the solution.
- Take the quick test at the end. Available to everyone; login enables saved progress.
- Apply one project in a sandbox environment and measure the effect on latency and error rates.
Mini challenge
Your login service must handle 10x traffic for a product launch with minimal cost increase.
- Pick an algorithm and justify in one sentence.
- Define drain timeout, slow start, and retry rules.
- List 3 metrics and alert thresholds you’ll watch during the launch.