How to learn Load Balancing Concepts for System Design Basics in Backend Engineer for free

Who this is for

Backend engineers preparing for system design interviews.
Developers running services that need uptime, scalability, and predictable latency.
Anyone moving from single-instance apps to multi-instance, highly available systems.

Prerequisites

Basic HTTP/HTTPS knowledge (methods, headers, status codes).
Familiarity with stateless vs stateful services.
Comfort with reading simple architecture diagrams and logs.

Why this matters

In real backend work you will:

Split traffic across multiple instances to remove single points of failure.
Keep services responsive during spikes and deploys.
Roll out new versions safely (canary/blue-green) without customer impact.
Route traffic by path/host to the right service (APIs, websockets, static files).

Real tasks you might do

Add a new API pod and verify the load balancer starts sending traffic only after it passes health checks.
Investigate uneven traffic: adjust from round-robin to least-connections.
Introduce cookie-based session affinity for a short-lived experiment, then remove it after moving sessions to a shared store.
Set up connection draining so in-flight requests finish during rolling deployments.

Concept explained simply

A load balancer (LB) sits in front of multiple server instances and decides which instance handles each incoming request or connection. The goal: share work fairly, keep latency low, and survive instance failures.

Layer 4 (L4): routes by IP/port (fast, protocol-agnostic: TCP/UDP).
Layer 7 (L7): understands HTTP(S)/gRPC; can route by path, host, headers, or method.

Core jobs of an LB:

Distribute: choose a target using an algorithm.
Health check: only send to healthy instances.
Retry or fail fast: if a target fails, pick another where safe.
Observe: expose metrics (RPS, errors, latency, active connections).

Mental model

Think of the LB as an air-traffic controller. Planes (requests) arrive constantly. The controller checks which runways (instances) are open and least busy, then assigns planes. If a runway closes, planes are diverted. The controller also decides special rules (e.g., heavy cargo goes to the long runway).

Algorithms and when to use

Round Robin

Cycles through instances in order. Simple and good when instances are identical and requests are similar.

Weighted Round Robin

Like round robin but gives more traffic to stronger instances (e.g., more CPU). Good when capacity differs.

Least Connections

Sends new requests to the instance with the fewest active connections. Better when request durations vary.

Least Response Time

Chooses the instance with the lowest observed latency and fewest connections. Good for latency-sensitive APIs.

Random (Power of Two Choices)

Pick two random instances and choose the less loaded of the two. Near-optimal with low overhead.

IP Hash / Source Affinity

Clients with the same IP land on the same instance. Useful for simple stickiness, but breaks with NAT or mobile networks.

Cookie-based Session Affinity

LB sets a cookie so a user keeps hitting the same instance. Quick fix for stateful apps, but can cause hot spots and complicate autoscaling.

Consistent Hashing

Maps keys (like userID) to instances with minimal remapping when instances change. Great for caches and sharded state.

Essential features

Health checks: active (periodic probes) and passive (watching failures). Configure timeouts and thresholds to avoid flapping.
Connection draining: stop sending new requests to an instance during deploy; let in-flight finish.
Slow start: ramp up traffic to new instances gradually.
Retries with budgets: only retry idempotent requests; cap total retries to avoid storms.
TLS termination: LB handles HTTPS; backends speak HTTP or mTLS as needed.
Path/host routing: send /api to API service, /assets to static servers, different hosts to different apps.
Rate limiting and WAF rules: protect downstreams.

L4 vs L7 quick guide

Choose L4 when you need protocol-agnostic speed (TCP, gRPC with passthrough, raw websockets) and minimal logic.
Choose L7 when you need smart routing, header inspection, HTTP/2 features, or per-route policies.

Worked examples

Example 1: Spiky API traffic

Problem: Traffic spikes; some requests slow down. Instances are identical, request durations vary a lot.

Approach: Switch algorithm from round robin to least connections or power-of-two-choices. Enable slow start for new instances. Add passive health checks and set connection draining during deploys.

Outcome: Better distribution under uneven workloads; fewer tail latencies.

Example 2: Stateful sessions

Problem: A legacy app stores sessions in memory. Users get logged out on failover.

Approach: Temporarily enable cookie-based session affinity on the LB. As a long-term fix, move sessions to a shared store (e.g., Redis) and remove affinity to improve balancing and resilience.

Outcome: Immediate stability with a path to stateless scaling.

Example 3: Cache sharding

Problem: A distributed cache must keep keys on consistent nodes during scale-out.

Approach: Use consistent hashing at the LB or client library so when nodes are added/removed only a small fraction of keys move.

Outcome: Higher cache hit rate and smoother scaling.

Step-by-step: design a load-balanced API

Define constraints: target RPS, p95 latency, failure domains (zones/regions), and peak vs average load.
Choose LB layer: L7 for smart routing and observability; L4 for raw speed or non-HTTP protocols.
Pick algorithm: start with round robin; prefer least connections or power-of-two for uneven workloads.
Health checks: set endpoint, timeout, interval, healthy/unhealthy thresholds.
Deploy safety: enable connection draining and slow start; cap retries per request.
State and stickiness: avoid stickiness by externalizing state; if needed, use short-lived cookie affinity.
Autoscaling signals: scale on CPU + request latency + queue depth, not just one metric.
Observe: track RPS, active conns, p95/p99 latency, 5xx rates per instance; alert on anomalies.

Mini tasks for each step

Write a 1-sentence SLO: “p95 < 200ms at 500 RPS.”
Choose: L4 or L7? State your reason in one line.
Pick your initial algorithm and when you’d switch.
Define health check path, interval (s), and thresholds (n).
Set a drain timeout and retry budget (max retries per request).
Decide: sticky on/off? If on, how to remove later?
Define two autoscaling triggers you trust.
Select 3 dashboards: per-target latency, errors, saturation.

Exercises

These mirror the exercises below. Do them here, then record your answers in your notes.

Exercise 1 — Pick and justify a strategy

Scenario: You run a read-heavy REST API with occasional long-running requests. Two instance types: 2 large and 3 medium. You deploy twice per day and need safe rollouts.

Choose L4 or L7 and justify.
Pick a balancing algorithm and whether to weight instances.
Define health check path, interval, timeout, healthy/unhealthy thresholds.
Set connection draining and slow start values.
Define a retry policy (which requests are safe to retry?).
Decide if you need stickiness. If yes, how to phase it out?

Self-check checklist

Did you avoid retrying non-idempotent requests?
Did you account for different instance sizes?
Did you include drain/slow-start for deploys?
Is your health check specific and fast?
Is there a plan to remove stickiness (if used)?

Common mistakes and how to self-check

Relying on sticky sessions forever. Self-check: can any instance handle any request after a cache warm-up?
No connection draining. Self-check: do deploys cause visible 499/5xx spikes?
Only CPU-based autoscaling. Self-check: add latency/queue depth signals.
Aggressive retries. Self-check: ensure idempotency; set retry budgets.
Weak health checks. Self-check: probe a lightweight but app-specific endpoint that confirms dependencies.
One AZ only. Self-check: simulate losing an AZ; does traffic rebalance cleanly?
Ignoring p99 latency. Self-check: track tail latency per instance and per route.

Practical projects

Spin up two API instances and place Nginx/HAProxy in front. Try round robin vs least connections. Record latency changes under mixed workloads.
Enable cookie-based affinity, trigger a deploy, and observe how draining + slow start affect errors. Then remove stickiness after moving sessions to a shared store.
Simulate consistent hashing with a small script. Add/remove nodes and measure the percentage of keys that move.
Add health checks that verify DB connectivity. Kill one instance and confirm the LB stops sending traffic within your thresholds.

Learning path

Before: HTTP fundamentals, stateless design, basic observability.
Now: Load balancing algorithms, health checks, connection management.
Next: Rate limiting, caching layers, service discovery, global traffic management (multi-region, DNS-based GSLB).

Next steps

Complete the exercise and compare with the solution.
Take the quick test at the end. Available to everyone; login enables saved progress.
Apply one project in a sandbox environment and measure the effect on latency and error rates.

Mini challenge

Your login service must handle 10x traffic for a product launch with minimal cost increase.

Pick an algorithm and justify in one sentence.
Define drain timeout, slow start, and retry rules.
List 3 metrics and alert thresholds you’ll watch during the launch.

Instructions

You operate an HTTP API with mixed request durations. You have 2 large and 3 medium instances behind a load balancer. You deploy twice daily and want stable rollouts.

Choose L4 or L7, and explain why in 2–3 sentences.
Pick an algorithm. If using weights, propose weights for large vs medium instances.
Define health check path, interval, timeout, healthy/unhealthy thresholds.
Specify connection draining timeout and slow start ramp (seconds or percentage).
Define a retry policy: which methods are safe to retry, and max retries.
Decide if you need session affinity. If yes, describe how you will later remove it.

Deliverable: a concise plan (bulleted) that an on-call engineer could implement.

Menu

Load Balancing Concepts

Table of Contents