Topic Not Found

Who this is for

Backend engineers moving from monoliths to microservices.
Developers deploying services on containers or VMs and needing reliable communication between services.
Engineers preparing to work with Kubernetes, Consul, Eureka, or service meshes.

Prerequisites

Comfortable with HTTP and basic networking (DNS, ports, IPs).
Basic understanding of microservices and stateless services.
Familiarity with health checks and load balancing concepts helps.

Why this matters

In microservices, instances start, stop, scale, and move. Hardcoding hostnames or IPs breaks quickly. Service discovery lets services find each other safely, route traffic to healthy instances, respect versions/regions, and survive failures.

Real tasks you will face:

Directing API traffic to the newest healthy version during a rolling deploy.
Failing over to another zone when an availability zone degrades.
Reducing cascading failures with timeouts, retries, and circuit breakers tied to discovery data.

Concept explained simply

Service discovery answers two questions at runtime: “Where is service X right now?” and “Which instance should I call?” It relies on a registry of healthy instances and a lookup mechanism that clients or gateways use to choose a target.

Mental model

Think of an airport arrivals board. Airlines (services) continuously publish which gates (IPs/ports) are active. Passengers (callers) look up the board just in time to find the right gate. If a gate changes, the board updates quickly. The board is your service registry.

Core components and patterns

Service registry: A consistent source of truth for live instances (e.g., stored as records with address, port, health, metadata like version/region). Health is updated via heartbeats or active checks.
Registration and Deregistration: Instances self-register on start and deregister on shutdown, or an orchestrator does it. Expiry/TTL removes stale instances.
Health checks: HTTP/TCP checks, TTL heartbeats. Only healthy instances are discoverable.
Client-side discovery: The caller queries the registry and picks an instance (e.g., round-robin, least-connections, weighted). Pros: smart clients; Cons: more complex clients.
Server-side discovery: A load balancer or gateway queries the registry, and clients call a stable endpoint. Pros: simpler clients; Cons: central component needs to scale.
DNS-based discovery: Use A/AAAA/SRV records to resolve service names to healthy endpoints. Simple and ubiquitous; caching and TTLs must be tuned.
Service mesh (sidecar): A local proxy per service handles discovery, load balancing, retries, and mTLS, keeping app code simple.
Load balancing strategies: Round-robin, random, least-connections, power-of-two-choice, weighted distributions, zone-aware routing.
Resilience policies: Timeouts, retries with backoff and jitter, circuit breakers, and outlier detection integrated with discovery.
Metadata-driven routing: Tags/labels for version (v1/v2), region, canary, or capabilities. Clients use filters to pick compatible instances.

How Kubernetes fits

Kubernetes offers built-in discovery: Services get stable names; DNS resolves them to cluster IPs or directly to Endpoints (headless Services). Readiness probes ensure only ready Pods receive traffic. This is a mix of server-side and DNS-based discovery.

Worked examples

Example 1: Client-side discovery with health and version pinning

Goal: A Payments service calls Orders only if it is healthy and version is v2.

// Registry records (conceptual)
orders/instances:
- id: ord-1  address: 10.0.1.10  port: 8080  health: passing  tags: ["v2","zone-a"]
- id: ord-2  address: 10.0.2.21  port: 8080  health: warning  tags: ["v2","zone-b"]
- id: ord-3  address: 10.0.3.33  port: 8080  health: passing  tags: ["v1","zone-b"]

// Caller logic (pseudocode)
instances = registry.query(service="orders")
healthy = filter(instances, i => i.health == "passing" && hasTag(i, "v2"))
choice = leastConnections(healthy)
call(choice.address:choice.port)

Result: ord-1 is selected. ord-2 is avoided due to degraded health; ord-3 is avoided due to version mismatch.

Example 2: Server-side discovery via gateway

Client calls api-gateway at a stable address.
Gateway asks registry for healthy inventory instances in the same zone as the caller.
Gateway applies weighted round-robin (80% to v2, 20% to canary v3).
On failure, gateway retries once with jitter to a different instance.

Benefit: client stays simple; rollout control and retries are centralized.

Example 3: DNS SRV-based discovery

_orders._tcp.svc.company.local 60 IN SRV 10 50 8080 ord-1.svc.company.local
_orders._tcp.svc.company.local 60 IN SRV 10 50 8080 ord-2.svc.company.local

Client resolves SRV records to get target and port. TTL=60s means changes propagate within a minute. Clients should avoid long-lived DNS caches and re-resolve periodically.

Example 4: Readiness vs liveness

A service passes liveness but fails readiness during startup while warming caches. Discovery should exclude it until readiness passes, preventing cold starts from serving traffic prematurely.

Exercises

Do these in order. You can check solutions below each task.

Exercise 1: Design registry entries with safe rollout

Create registry records for a service named catalog with three instances. Two are v2 in different zones; one is a canary v3. Only healthy instances should be routable. Include fields: id, address, port, health, tags (version, zone, canary yes/no).

Include one failing instance and explain why it should be excluded.
Mark the canary so callers can weight it lower.

Show solution

catalog/instances:
- id: cat-a  address: 10.1.0.11  port: 7000  health: passing  tags: ["v2","zone-a","canary:no"]
- id: cat-b  address: 10.1.1.22  port: 7000  health: critical tags: ["v2","zone-b","canary:no"]
- id: cat-c  address: 10.1.2.33  port: 7001  health: passing  tags: ["v3","zone-b","canary:yes"]

// Routing
// Exclude cat-b due to critical health.
// Weight: cat-a 80%, cat-c 20% (canary).

Exercise 2: Implement picker with timeouts and retries

Write pseudocode for a client-side discovery picker that:

Filters for health=passing and tag version=v2.
Uses power-of-two-choices to select an instance by least-connections.
Applies a 300ms request timeout and at most 1 retry to a different instance with exponential backoff (200ms base + jitter).

Show solution

instances = registry.query("checkout")
candidates = filter(instances, i => i.health == "passing" && hasTag(i, "v2"))

function pickPow2(cs) {
  a = randomChoice(cs); b = randomChoice(cs)
  return (a.activeConns <= b.activeConns) ? a : b
}

function callWithRetry(req) {
  try {
    target = pickPow2(candidates)
    return httpCall(target, req, timeout=300ms)
  } catch (e) {
    backoff = 200ms + random(0..50ms)
    sleep(backoff)
    alt = pickPow2(candidates.filter(c => c.id != target.id))
    return httpCall(alt, req, timeout=300ms)
  }
}

Self-check checklist

I excluded non-ready or failing instances from routing.
I used metadata (tags) to control version or canary traffic.
I picked a balancing strategy and could explain why.
I set explicit timeouts and bounded retries with jitter.
I considered zone awareness when applicable.

Common mistakes and how to catch them

Relying on IP lists in config: Instances change. Self-check: does deployment require a config update to add capacity? If yes, fix discovery.
Ignoring readiness: New pods get traffic too early. Self-check: simulate cold start; does error rate spike? Gate routing by readiness.
Unbounded retries: Amplifies outages. Self-check: ensure max attempts and total timeout budget are enforced.
Sticky caches: DNS or client caches never refresh. Self-check: rotate an instance; verify callers discover changes within TTL.
No metadata: Can’t canary or region-route. Self-check: are version/zone tags available and validated?
Single registry without HA: Discovery becomes a single point of failure. Self-check: simulate registry node loss; does discovery continue?

Practical projects

Blue/Green + Canary: Tag services with v2 and canary:yes, route 10% to canary, then ramp to 100% while watching error rates.
Zone-aware routing: Prefer same-zone instances and only cross-zone on saturation. Measure latency improvements.
Outlier detection: Eject an instance that exceeds error threshold for 1 minute; auto-return on recovery.

Learning path

Learn client-side vs server-side discovery trade-offs.
Add health checks and readiness gates to your services.
Introduce metadata tags to control routing (version, region, canary).
Implement timeouts, retries with backoff, and circuit breaking.
Practice DNS-based discovery and understand TTL behavior.
Explore service mesh features if you need mTLS and policy centralization.

Next steps

Instrument discovery decisions with logs/metrics (chosen instance, retry counts, timeouts).
Practice failure injection: kill an instance and confirm traffic shifts automatically.
Move on to request routing and API gateway patterns.

Mini challenge

Write a short policy for a staging environment:

Route 5% of traffic to v3 canary when error rate < 1% over 5 minutes; otherwise fall back to 0%.
Prefer same-zone traffic; cross-zone only if all same-zone instances exceed 80% connection load.
Global timeout budget per request: 600ms, with at most 1 retry.

Hint

Express it as metadata filters plus a weighted policy, then define guardrails (error threshold, connection load) and a fallback.

Quick Test

This short test is available to everyone. If you are logged in, your progress will be saved.

Menu

Service Discovery Concepts

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Core components and patterns

Worked examples

Exercises

Exercise 1: Design registry entries with safe rollout

Exercise 2: Implement picker with timeouts and retries

Self-check checklist

Common mistakes and how to catch them

Practical projects

Learning path

Next steps

Mini challenge

Quick Test

Practice Exercises

Design registry entries with safe rollout

Instructions

Expected Output

Implement picker with timeouts and retries

Service Discovery Concepts — Quick Test

Have questions about Service Discovery Concepts?

AI Assistant