Who this is for
Platform Engineers and Backend Engineers who need reliable, automated ways for services to find and talk to each other across environments (local, staging, production, multi-cluster, or multi-region).
Prerequisites
- Basic networking: IP, DNS, ports, TCP/HTTP.
- Containers and orchestration basics (Docker, Kubernetes concepts).
- Familiarity with load balancing and health checks.
Why this matters
Real platform tasks depend on service discovery:
- Rolling out new microservice versions without breaking callers.
- Autoscaling instances and letting clients find new ones instantly.
- Failing over between zones/regions during incidents.
- Routing traffic to healthy instances only.
- De-risking blue/green and canary releases.
Concept explained simply
Service discovery is how one service (the caller) finds the network location of another service (the callee) at runtime. In dynamic environments, IPs change often; discovery gives you a stable name that resolves to the right, healthy endpoints.
Mental model
Think of a phone book that updates itself. Services register their current number (IP:port) and health. Clients look up a name (like payments) and get a current list of healthy numbers. A load balancer or the client chooses one to call.
Key building blocks
Naming
Stable names (e.g., payments) map to dynamic endpoints. In Kubernetes, a Service name becomes a DNS name (e.g., payments.default.svc.cluster.local).
Service registry
A database of service instances and their health. Examples in practice: Kubernetes Endpoints/EndpointSlice, Consul catalog, etcd-backed systems, or service mesh catalogs.
Health and liveness
Registries use health checks (HTTP/TCP checks, heartbeats, TTLs) to include only healthy instances. Unhealthy instances are removed until they recover.
Discovery patterns
- Client-side discovery: Clients query the registry/DNS, pick an instance, and connect. Pro: flexible; Con: client must implement logic.
- Server-side discovery: Clients call a stable load balancer or proxy; the proxy looks up endpoints and forwards. Pro: centralized logic; Con: extra hop.
- DNS-based discovery: Clients use DNS A/SRV records. Simple and widely supported; use TTLs thoughtfully to control staleness.
- Service mesh: Sidecar proxies (e.g., Envoy) handle discovery, mTLS, retries, and traffic policies on behalf of the app.
Resilience policies
Discovery works best with: timeouts, retries with jitter, circuit breakers, load balancing (round-robin, least-request), and backoff during failures.
Consistency and freshness
Registries may be eventually consistent. Expect brief staleness. Use health checks, short-but-safe TTLs, and retry policies to cope with changes.
Security
Combine discovery with auth/mTLS so only authorized services resolve and connect to backends. Limit who can register endpoints.
Worked examples
Example 1: Kubernetes Service (ClusterIP) with DNS
- Create a Deployment for payments with 3 replicas.
- Create a ClusterIP Service named payments on port 8080.
- Clients call http://payments:8080 inside the same namespace. Kubernetes DNS resolves payments to the Service IP, and kube-proxy forwards to healthy pods.
Sample manifests
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments
spec:
replicas: 3
selector:
matchLabels:
app: payments
template:
metadata:
labels:
app: payments
spec:
containers:
- name: svc
image: payments:1.0
ports:
- containerPort: 8080
readinessProbe:
httpGet: { path: /ready, port: 8080 }
---
apiVersion: v1
kind: Service
metadata:
name: payments
spec:
selector:
app: payments
ports:
- name: http
port: 8080
targetPort: 8080
Example 2: Kubernetes headless Service with SRV
For stateful services (e.g., databases), use a headless Service to return pod records directly so clients can connect to specific instances.
Sample manifest
apiVersion: v1
kind: Service
metadata:
name: db
spec:
clusterIP: None
selector:
app: db
ports:
- name: tcp
port: 5432
targetPort: 5432
DNS returns A records for each pod (e.g., db-0.db, db-1.db). Stateful clients can select primaries/replicas as needed.
Example 3: Client-side discovery with a registry (Consul-like)
- Each service registers itself with name, address, port, tags (e.g., version=2), and a health check.
- Clients query the registry (or local sidecar) for healthy instances of payments.
- Client picks one using least-request and calls it. On failures, it retries with jitter.
Example registration payload
{
"Name": "payments",
"Address": "10.0.12.34",
"Port": 8080,
"Tags": ["version=2"],
"Check": {"HTTP": "http://10.0.12.34:8080/health", "Interval": "5s"}
}
Hands-on exercises
Note: The quick test is available to everyone. If you log in, your progress will be saved automatically.
Exercise 1: Design a safe discovery plan in Kubernetes
Mirror of Exercise ex1 below. Draft a plan for three services (api, payments, notifications) with suitable Service types, selectors, and health probes. Choose DNS names and explain how clients resolve them.
Exercise 2: Implement client-side choice logic
Mirror of Exercise ex2 below. Given a small in-memory registry, decide which instance to call next, considering health and weights. Show your selection algorithm.
Self-check checklist
- You used readiness probes to keep failing pods out of rotation.
- You picked TTLs/refresh intervals that balance freshness and stability.
- You included retry with jitter and a sensible timeout.
- You considered version-aware routing for canaries.
- You planned for partial registry staleness.
Common mistakes and how to self-check
- Using liveness probes instead of readiness for traffic gating. Self-check: Are failing startup pods receiving traffic? Use readiness for routing decisions.
- Ignoring TTLs. Self-check: Do clients cache DNS for too long and miss new pods? Reduce TTL or increase refresh frequency.
- No timeouts/retries. Self-check: Do rare blips cause long client hangs? Add short timeouts and retry with jitter.
- Hardcoding IPs. Self-check: Any configuration with literal pod IPs? Replace with names and selectors.
- Single-zone thinking. Self-check: Can traffic shift if one zone fails? Use topology-aware routing or cross-zone endpoints.
Practical projects
-
Blue/Green via labels:
- Run payments v1 and v2 behind a single Service.
- Use labels (version=v1/v2) and a mesh or ingress rule to shift 10% to v2.
- Verify discovery updates as replicas scale.
-
Headless DB with client pinning:
- Deploy a 3-replica StatefulSet database.
- Use headless Service and teach the app to pin to db-0 for writes, others for reads.
- Simulate a pod restart and verify failover.
-
Consul-style local registry cache:
- Write a small sidecar that polls a registry endpoint and exposes /endpoints locally.
- Clients query localhost for discovery to reduce central load.
- Add ETag or version to handle incremental updates.
Learning path
- Revise DNS basics (A, AAAA, SRV, TTL).
- Learn Kubernetes Services, Endpoints/EndpointSlice, and readiness probes.
- Study client-side vs server-side discovery; add retries and timeouts.
- Introduce version-aware routing (blue/green, canary).
- Explore service mesh patterns (sidecars, mTLS, traffic policies).
Next steps
- Complete the quick test to validate understanding.
- Implement one practical project in a dev cluster.
- Document your team’s standard for discovery (naming, TTL, probes, retries).
Mini challenge
Your app calls search and profile services. During a partial outage, 30% of search pods fail readiness. Outline, in 5–7 bullet points, how your client should behave (timeouts, retries, fallback) and how the registry/DNS should reflect the change.