Who this is for
- Backend and platform engineers deploying services to containers, VMs, or serverless.
- Developers responsible for keeping services healthy behind load balancers or orchestrators like Kubernetes.
- Engineers preparing for on-call rotations and reliability-focused interviews.
Prerequisites
- Basic HTTP knowledge (status codes, routes).
- Familiarity with your app's dependencies (database, cache, queues).
- Optional but helpful: container basics and Kubernetes or another orchestrator.
Why this matters
Health checks and readiness probes decide if traffic should be sent to your service and when it should restart. Well-designed checks prevent incidents like:
- New versions receiving traffic before migrations finish.
- Crash loops caused by liveness checks that are too strict.
- Healthy pods staying in rotation during dependency outages, causing timeouts for users.
Real tasks you will do on the job
- Expose /healthz and /readyz endpoints with clear signals.
- Configure Kubernetes liveness, readiness, and startup probes with sensible timeouts and thresholds.
- Coordinate graceful shutdown so in-flight requests finish before the pod is removed from rotation.
- Differentiate critical vs optional dependencies in readiness logic.
Concept explained simply
Think of your service as a restaurant:
- Liveness: “Is the building still standing and staff awake?” If not, restart.
- Readiness: “Are we open for customers right now?” If the oven (DB) is broken, don’t seat new guests.
- Startup: “Are we fully prepared to open?” Don’t take customers while preheating or setting tables.
Mental model
- Liveness checks should be minimal and fast, focused on the process itself.
- Readiness checks determine if the instance can serve traffic safely given its dependencies.
- Startup checks gate the initial warm-up so liveness/readiness don’t flap during boot.
Key definitions and signals
- Liveness: Return 200 if the process is responsive and event loop/thread pool is not deadlocked. Avoid heavy work.
- Readiness: Return 200 only if critical dependencies are available and the instance is ready to accept traffic. Otherwise 503.
- Startup: Return 200 once migrations, caches, and warm-up are done; until then return 503.
- HTTP codes: 200/204 for healthy; 503 for temporarily not ready; 500 for internal failure (use sparingly for liveness).
- Latency budgets: Keep checks under 100–300 ms to avoid self-inflicted timeouts.
Design principles
- Keep liveness dumb and cheap. Don’t call external systems.
- Make readiness reflect user impact. Only include dependencies that must be healthy for safe traffic.
- Use startup probes during warm-up (migrations, JIT, cache warm) to prevent early restarts.
- Prefer timeouts to direct failures. A failing dependency should return 503 quickly instead of hanging.
- Fail closed for critical dependencies, fail open for optional ones (return degraded=true in body, but 200 status).
- Include build/version and uptime in responses to aid debugging.
- During shutdown: mark not-ready first, wait drain period, then stop.
Worked examples
Example 1: REST API depending on Postgres (critical) and Redis (optional)
Endpoints:
{
"GET /healthz": 200 if event loop responsive and self-check ok,
"GET /readyz": 200 if Postgres ping <100ms and pool has spare connections; Redis optional,
"GET /startupz": 200 only after migrations complete and HTTP server finished warm-up
}
Behaviors:
- Postgres down: /healthz=200 (stay alive), /readyz=503 (stop receiving traffic).
- Redis down: /readyz=200 with body {"degraded": true, "redis": "down"}.
- During migrations: /startupz=503 so Kubernetes waits before checking readiness.
Example 2: Worker consuming from a queue
- Liveness: 200 if worker loop heartbeat updated within last 10s.
- Readiness: 200 if connected to queue and can ack a test message or declare a heartbeat queue; otherwise 503.
- Startup: 200 after handlers registered and backoff scheduler initialized.
Why: The worker can be alive but not ready if the queue is unreachable.
Example 3: Feature flags and third-party API
- Third-party payments are not critical for most endpoints.
- Readiness returns 200 if core DB is healthy; includes field payments_api: "degraded" when failing.
- Route-level checks: Optionally guard payment routes with a fast preflight check or circuit breaker.
Implementation patterns
HTTP responses to aim for
GET /healthz -> 200 {"status":"ok","uptime":1234,"version":"1.2.3"}
GET /readyz -> 200 or 503; body includes dependency states
GET /startupz -> 200 when warm; 503 during boot/migrations
Kubernetes probe template (tune values)
livenessProbe:
httpGet: { path: "/healthz", port: 8080 }
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet: { path: "/readyz", port: 8080 }
initialDelaySeconds: 5
timeoutSeconds: 1
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet: { path: "/startupz", port: 8080 }
failureThreshold: 30
periodSeconds: 5
Graceful shutdown (sequence)
- Receive SIGTERM.
- Start returning 503 on /readyz.
- Stop accepting new requests; keep serving in-flight.
- After drain timeout, exit cleanly.
Common mistakes and self-check
- Putting DB calls in liveness: causes restart loops during transient DB outages. Fix: DB checks belong in readiness.
- Skipping startup probe: readiness flaps during migrations and triggers unnecessary rollbacks.
- Slow checks: a 2s readiness check can cascade into timeouts under load. Keep checks under ~100–300 ms.
- Binary readiness only: return 200/503 plus a body with per-dependency states for diagnosis.
- Not handling shutdown: pods keep receiving traffic and drop requests. Always mark not ready before exit.
Self-check
- Can you kill Redis and still keep readiness=200 (degraded) and user-critical paths working?
- Does a DB outage switch readiness to 503 quickly (under your SLO) without restarting the pod?
- On deploy, do you avoid traffic until warm-up and migrations complete?
Exercises
Do these hands-on tasks. The quick test is available to everyone. Only logged-in users will have their progress saved.
Exercise 1: Design health endpoints for an API
Service: HTTP API with critical Postgres, optional Redis cache, and optional Payments API.
- Define responses for /healthz, /readyz, /startupz.
- Decide which dependencies affect each endpoint.
- Sketch a Kubernetes probe config (timeouts, periods, thresholds).
Checklist
- Healthz is dependency-free and fast.
- Readyz fails fast when Postgres is down.
- Startupz blocks until migrations and warm-up complete.
- Probe timeouts <= 1s; readiness period ~5s; liveness period ~10s.
Exercise 2: Worker readiness logic
Service: Background worker pulling jobs from a queue (critical). Metrics endpoint is on port 9090.
- Propose how /healthz and /readyz should behave.
- Write a probe config ensuring the worker is removed from rotation if it loses queue connectivity.
- Explain how you detect the worker loop is alive (no deadlock).
Checklist
- Liveness checks internal heartbeat, not the queue.
- Readiness depends on queue connectivity.
- Startup waits for handler registration and backoff scheduler.
Practical projects
- Project A: Build a tiny service exposing /healthz, /readyz, /startupz. Simulate DB up/down with a flag and observe Kubernetes probe behavior locally (kind/minikube or docker-compose healthchecks).
- Project B: Add graceful shutdown. On SIGTERM, flip readiness to 503, wait 10 seconds, then exit. Validate no requests are dropped with a load generator.
Learning path
- Implement basic /healthz with uptime and version.
- Add /readyz that checks only critical dependencies with tight timeouts.
- Add /startupz to gate warm-up and migrations.
- Configure orchestrator probes; test failure modes (DB down, cache down, third-party down).
- Implement graceful shutdown and verify draining.
- Instrument metrics (counters for readiness failures, check latencies) to observe behavior.
Next steps
- Extend readiness to report per-dependency status in JSON.
- Add circuit breakers around flaky dependencies; keep readiness focused on truly critical ones.
- Define alerting on readiness flap rate and liveness restarts.
Mini challenge
Your API has a cron-triggered daily DB maintenance lasting 2 minutes (queries slow to 2s). How do you avoid timeouts for users without restarting pods? Write the readiness behavior, probe timeouts/periods, and any feature flags or degraded modes you will use.