Who this is for
- Backend engineers moving from monoliths to microservices.
- Developers deploying services on containers or VMs and needing reliable communication between services.
- Engineers preparing to work with Kubernetes, Consul, Eureka, or service meshes.
Prerequisites
- Comfortable with HTTP and basic networking (DNS, ports, IPs).
- Basic understanding of microservices and stateless services.
- Familiarity with health checks and load balancing concepts helps.
Why this matters
In microservices, instances start, stop, scale, and move. Hardcoding hostnames or IPs breaks quickly. Service discovery lets services find each other safely, route traffic to healthy instances, respect versions/regions, and survive failures.
Real tasks you will face:
- Directing API traffic to the newest healthy version during a rolling deploy.
- Failing over to another zone when an availability zone degrades.
- Reducing cascading failures with timeouts, retries, and circuit breakers tied to discovery data.
Concept explained simply
Service discovery answers two questions at runtime: “Where is service X right now?” and “Which instance should I call?” It relies on a registry of healthy instances and a lookup mechanism that clients or gateways use to choose a target.
Mental model
Think of an airport arrivals board. Airlines (services) continuously publish which gates (IPs/ports) are active. Passengers (callers) look up the board just in time to find the right gate. If a gate changes, the board updates quickly. The board is your service registry.
Core components and patterns
- Service registry: A consistent source of truth for live instances (e.g., stored as records with address, port, health, metadata like version/region). Health is updated via heartbeats or active checks.
- Registration and Deregistration: Instances self-register on start and deregister on shutdown, or an orchestrator does it. Expiry/TTL removes stale instances.
- Health checks: HTTP/TCP checks, TTL heartbeats. Only healthy instances are discoverable.
- Client-side discovery: The caller queries the registry and picks an instance (e.g., round-robin, least-connections, weighted). Pros: smart clients; Cons: more complex clients.
- Server-side discovery: A load balancer or gateway queries the registry, and clients call a stable endpoint. Pros: simpler clients; Cons: central component needs to scale.
- DNS-based discovery: Use A/AAAA/SRV records to resolve service names to healthy endpoints. Simple and ubiquitous; caching and TTLs must be tuned.
- Service mesh (sidecar): A local proxy per service handles discovery, load balancing, retries, and mTLS, keeping app code simple.
- Load balancing strategies: Round-robin, random, least-connections, power-of-two-choice, weighted distributions, zone-aware routing.
- Resilience policies: Timeouts, retries with backoff and jitter, circuit breakers, and outlier detection integrated with discovery.
- Metadata-driven routing: Tags/labels for version (v1/v2), region, canary, or capabilities. Clients use filters to pick compatible instances.
How Kubernetes fits
Kubernetes offers built-in discovery: Services get stable names; DNS resolves them to cluster IPs or directly to Endpoints (headless Services). Readiness probes ensure only ready Pods receive traffic. This is a mix of server-side and DNS-based discovery.
Worked examples
Example 1: Client-side discovery with health and version pinning
Goal: A Payments service calls Orders only if it is healthy and version is v2.
// Registry records (conceptual)
orders/instances:
- id: ord-1 address: 10.0.1.10 port: 8080 health: passing tags: ["v2","zone-a"]
- id: ord-2 address: 10.0.2.21 port: 8080 health: warning tags: ["v2","zone-b"]
- id: ord-3 address: 10.0.3.33 port: 8080 health: passing tags: ["v1","zone-b"]
// Caller logic (pseudocode)
instances = registry.query(service="orders")
healthy = filter(instances, i => i.health == "passing" && hasTag(i, "v2"))
choice = leastConnections(healthy)
call(choice.address:choice.port)
Result: ord-1 is selected. ord-2 is avoided due to degraded health; ord-3 is avoided due to version mismatch.
Example 2: Server-side discovery via gateway
- Client calls api-gateway at a stable address.
- Gateway asks registry for healthy inventory instances in the same zone as the caller.
- Gateway applies weighted round-robin (80% to v2, 20% to canary v3).
- On failure, gateway retries once with jitter to a different instance.
Benefit: client stays simple; rollout control and retries are centralized.
Example 3: DNS SRV-based discovery
_orders._tcp.svc.company.local 60 IN SRV 10 50 8080 ord-1.svc.company.local
_orders._tcp.svc.company.local 60 IN SRV 10 50 8080 ord-2.svc.company.local
Client resolves SRV records to get target and port. TTL=60s means changes propagate within a minute. Clients should avoid long-lived DNS caches and re-resolve periodically.
Example 4: Readiness vs liveness
A service passes liveness but fails readiness during startup while warming caches. Discovery should exclude it until readiness passes, preventing cold starts from serving traffic prematurely.
Exercises
Do these in order. You can check solutions below each task.
Exercise 1: Design registry entries with safe rollout
Create registry records for a service named catalog with three instances. Two are v2 in different zones; one is a canary v3. Only healthy instances should be routable. Include fields: id, address, port, health, tags (version, zone, canary yes/no).
- Include one failing instance and explain why it should be excluded.
- Mark the canary so callers can weight it lower.
Show solution
catalog/instances:
- id: cat-a address: 10.1.0.11 port: 7000 health: passing tags: ["v2","zone-a","canary:no"]
- id: cat-b address: 10.1.1.22 port: 7000 health: critical tags: ["v2","zone-b","canary:no"]
- id: cat-c address: 10.1.2.33 port: 7001 health: passing tags: ["v3","zone-b","canary:yes"]
// Routing
// Exclude cat-b due to critical health.
// Weight: cat-a 80%, cat-c 20% (canary).
Exercise 2: Implement picker with timeouts and retries
Write pseudocode for a client-side discovery picker that:
- Filters for health=passing and tag version=v2.
- Uses power-of-two-choices to select an instance by least-connections.
- Applies a 300ms request timeout and at most 1 retry to a different instance with exponential backoff (200ms base + jitter).
Show solution
instances = registry.query("checkout")
candidates = filter(instances, i => i.health == "passing" && hasTag(i, "v2"))
function pickPow2(cs) {
a = randomChoice(cs); b = randomChoice(cs)
return (a.activeConns <= b.activeConns) ? a : b
}
function callWithRetry(req) {
try {
target = pickPow2(candidates)
return httpCall(target, req, timeout=300ms)
} catch (e) {
backoff = 200ms + random(0..50ms)
sleep(backoff)
alt = pickPow2(candidates.filter(c => c.id != target.id))
return httpCall(alt, req, timeout=300ms)
}
}
Self-check checklist
- I excluded non-ready or failing instances from routing.
- I used metadata (tags) to control version or canary traffic.
- I picked a balancing strategy and could explain why.
- I set explicit timeouts and bounded retries with jitter.
- I considered zone awareness when applicable.
Common mistakes and how to catch them
- Relying on IP lists in config: Instances change. Self-check: does deployment require a config update to add capacity? If yes, fix discovery.
- Ignoring readiness: New pods get traffic too early. Self-check: simulate cold start; does error rate spike? Gate routing by readiness.
- Unbounded retries: Amplifies outages. Self-check: ensure max attempts and total timeout budget are enforced.
- Sticky caches: DNS or client caches never refresh. Self-check: rotate an instance; verify callers discover changes within TTL.
- No metadata: Can’t canary or region-route. Self-check: are version/zone tags available and validated?
- Single registry without HA: Discovery becomes a single point of failure. Self-check: simulate registry node loss; does discovery continue?
Practical projects
- Blue/Green + Canary: Tag services with v2 and canary:yes, route 10% to canary, then ramp to 100% while watching error rates.
- Zone-aware routing: Prefer same-zone instances and only cross-zone on saturation. Measure latency improvements.
- Outlier detection: Eject an instance that exceeds error threshold for 1 minute; auto-return on recovery.
Learning path
- Learn client-side vs server-side discovery trade-offs.
- Add health checks and readiness gates to your services.
- Introduce metadata tags to control routing (version, region, canary).
- Implement timeouts, retries with backoff, and circuit breaking.
- Practice DNS-based discovery and understand TTL behavior.
- Explore service mesh features if you need mTLS and policy centralization.
Next steps
- Instrument discovery decisions with logs/metrics (chosen instance, retry counts, timeouts).
- Practice failure injection: kill an instance and confirm traffic shifts automatically.
- Move on to request routing and API gateway patterns.
Mini challenge
Write a short policy for a staging environment:
- Route 5% of traffic to v3 canary when error rate < 1% over 5 minutes; otherwise fall back to 0%.
- Prefer same-zone traffic; cross-zone only if all same-zone instances exceed 80% connection load.
- Global timeout budget per request: 600ms, with at most 1 retry.
Hint
Express it as metadata filters plus a weighted policy, then define guardrails (error threshold, connection load) and a fallback.
Quick Test
This short test is available to everyone. If you are logged in, your progress will be saved.