Menu

Topic 6 of 8

Health Checks And Readiness Probes

Learn Health Checks And Readiness Probes for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Who this is for

  • Backend and platform engineers deploying services to containers, VMs, or serverless.
  • Developers responsible for keeping services healthy behind load balancers or orchestrators like Kubernetes.
  • Engineers preparing for on-call rotations and reliability-focused interviews.

Prerequisites

  • Basic HTTP knowledge (status codes, routes).
  • Familiarity with your app's dependencies (database, cache, queues).
  • Optional but helpful: container basics and Kubernetes or another orchestrator.

Why this matters

Health checks and readiness probes decide if traffic should be sent to your service and when it should restart. Well-designed checks prevent incidents like:

  • New versions receiving traffic before migrations finish.
  • Crash loops caused by liveness checks that are too strict.
  • Healthy pods staying in rotation during dependency outages, causing timeouts for users.
Real tasks you will do on the job
  • Expose /healthz and /readyz endpoints with clear signals.
  • Configure Kubernetes liveness, readiness, and startup probes with sensible timeouts and thresholds.
  • Coordinate graceful shutdown so in-flight requests finish before the pod is removed from rotation.
  • Differentiate critical vs optional dependencies in readiness logic.

Concept explained simply

Think of your service as a restaurant:

  • Liveness: “Is the building still standing and staff awake?” If not, restart.
  • Readiness: “Are we open for customers right now?” If the oven (DB) is broken, don’t seat new guests.
  • Startup: “Are we fully prepared to open?” Don’t take customers while preheating or setting tables.

Mental model

  • Liveness checks should be minimal and fast, focused on the process itself.
  • Readiness checks determine if the instance can serve traffic safely given its dependencies.
  • Startup checks gate the initial warm-up so liveness/readiness don’t flap during boot.

Key definitions and signals

  • Liveness: Return 200 if the process is responsive and event loop/thread pool is not deadlocked. Avoid heavy work.
  • Readiness: Return 200 only if critical dependencies are available and the instance is ready to accept traffic. Otherwise 503.
  • Startup: Return 200 once migrations, caches, and warm-up are done; until then return 503.
  • HTTP codes: 200/204 for healthy; 503 for temporarily not ready; 500 for internal failure (use sparingly for liveness).
  • Latency budgets: Keep checks under 100–300 ms to avoid self-inflicted timeouts.

Design principles

  • Keep liveness dumb and cheap. Don’t call external systems.
  • Make readiness reflect user impact. Only include dependencies that must be healthy for safe traffic.
  • Use startup probes during warm-up (migrations, JIT, cache warm) to prevent early restarts.
  • Prefer timeouts to direct failures. A failing dependency should return 503 quickly instead of hanging.
  • Fail closed for critical dependencies, fail open for optional ones (return degraded=true in body, but 200 status).
  • Include build/version and uptime in responses to aid debugging.
  • During shutdown: mark not-ready first, wait drain period, then stop.

Worked examples

Example 1: REST API depending on Postgres (critical) and Redis (optional)

Endpoints:

{
  "GET /healthz": 200 if event loop responsive and self-check ok,
  "GET /readyz": 200 if Postgres ping <100ms and pool has spare connections; Redis optional,
  "GET /startupz": 200 only after migrations complete and HTTP server finished warm-up
}

Behaviors:

  • Postgres down: /healthz=200 (stay alive), /readyz=503 (stop receiving traffic).
  • Redis down: /readyz=200 with body {"degraded": true, "redis": "down"}.
  • During migrations: /startupz=503 so Kubernetes waits before checking readiness.
Example 2: Worker consuming from a queue
  • Liveness: 200 if worker loop heartbeat updated within last 10s.
  • Readiness: 200 if connected to queue and can ack a test message or declare a heartbeat queue; otherwise 503.
  • Startup: 200 after handlers registered and backoff scheduler initialized.

Why: The worker can be alive but not ready if the queue is unreachable.

Example 3: Feature flags and third-party API
  • Third-party payments are not critical for most endpoints.
  • Readiness returns 200 if core DB is healthy; includes field payments_api: "degraded" when failing.
  • Route-level checks: Optionally guard payment routes with a fast preflight check or circuit breaker.

Implementation patterns

HTTP responses to aim for
GET /healthz  -> 200 {"status":"ok","uptime":1234,"version":"1.2.3"}
GET /readyz   -> 200 or 503; body includes dependency states
GET /startupz -> 200 when warm; 503 during boot/migrations
Kubernetes probe template (tune values)
livenessProbe:
  httpGet: { path: "/healthz", port: 8080 }
  initialDelaySeconds: 10
  timeoutSeconds: 1
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet: { path: "/readyz", port: 8080 }
  initialDelaySeconds: 5
  timeoutSeconds: 1
  periodSeconds: 5
  failureThreshold: 3
startupProbe:
  httpGet: { path: "/startupz", port: 8080 }
  failureThreshold: 30
  periodSeconds: 5
Graceful shutdown (sequence)
  1. Receive SIGTERM.
  2. Start returning 503 on /readyz.
  3. Stop accepting new requests; keep serving in-flight.
  4. After drain timeout, exit cleanly.

Common mistakes and self-check

  • Putting DB calls in liveness: causes restart loops during transient DB outages. Fix: DB checks belong in readiness.
  • Skipping startup probe: readiness flaps during migrations and triggers unnecessary rollbacks.
  • Slow checks: a 2s readiness check can cascade into timeouts under load. Keep checks under ~100–300 ms.
  • Binary readiness only: return 200/503 plus a body with per-dependency states for diagnosis.
  • Not handling shutdown: pods keep receiving traffic and drop requests. Always mark not ready before exit.
Self-check
  • Can you kill Redis and still keep readiness=200 (degraded) and user-critical paths working?
  • Does a DB outage switch readiness to 503 quickly (under your SLO) without restarting the pod?
  • On deploy, do you avoid traffic until warm-up and migrations complete?

Exercises

Do these hands-on tasks. The quick test is available to everyone. Only logged-in users will have their progress saved.

Exercise 1: Design health endpoints for an API

Service: HTTP API with critical Postgres, optional Redis cache, and optional Payments API.

  1. Define responses for /healthz, /readyz, /startupz.
  2. Decide which dependencies affect each endpoint.
  3. Sketch a Kubernetes probe config (timeouts, periods, thresholds).
Checklist
  • Healthz is dependency-free and fast.
  • Readyz fails fast when Postgres is down.
  • Startupz blocks until migrations and warm-up complete.
  • Probe timeouts <= 1s; readiness period ~5s; liveness period ~10s.

Exercise 2: Worker readiness logic

Service: Background worker pulling jobs from a queue (critical). Metrics endpoint is on port 9090.

  1. Propose how /healthz and /readyz should behave.
  2. Write a probe config ensuring the worker is removed from rotation if it loses queue connectivity.
  3. Explain how you detect the worker loop is alive (no deadlock).
Checklist
  • Liveness checks internal heartbeat, not the queue.
  • Readiness depends on queue connectivity.
  • Startup waits for handler registration and backoff scheduler.

Practical projects

  • Project A: Build a tiny service exposing /healthz, /readyz, /startupz. Simulate DB up/down with a flag and observe Kubernetes probe behavior locally (kind/minikube or docker-compose healthchecks).
  • Project B: Add graceful shutdown. On SIGTERM, flip readiness to 503, wait 10 seconds, then exit. Validate no requests are dropped with a load generator.

Learning path

  1. Implement basic /healthz with uptime and version.
  2. Add /readyz that checks only critical dependencies with tight timeouts.
  3. Add /startupz to gate warm-up and migrations.
  4. Configure orchestrator probes; test failure modes (DB down, cache down, third-party down).
  5. Implement graceful shutdown and verify draining.
  6. Instrument metrics (counters for readiness failures, check latencies) to observe behavior.

Next steps

  • Extend readiness to report per-dependency status in JSON.
  • Add circuit breakers around flaky dependencies; keep readiness focused on truly critical ones.
  • Define alerting on readiness flap rate and liveness restarts.

Mini challenge

Your API has a cron-triggered daily DB maintenance lasting 2 minutes (queries slow to 2s). How do you avoid timeouts for users without restarting pods? Write the readiness behavior, probe timeouts/periods, and any feature flags or degraded modes you will use.

Practice Exercises

2 exercises to complete

Instructions

Service: HTTP API with critical Postgres, optional Redis cache, and optional Payments API.

  1. Define responses for /healthz, /readyz, /startupz (status codes and JSON bodies).
  2. Decide which dependencies affect each endpoint and why.
  3. Draft a Kubernetes probe config (liveness, readiness, startup) with timeouts, periods, and thresholds.
Expected Output
A short spec describing each endpoint behavior and a probe YAML showing sane values. Readiness should return 503 when Postgres is down; Redis/Payments should not fail readiness, only mark degraded.

Health Checks And Readiness Probes — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Health Checks And Readiness Probes?

AI Assistant

Ask questions about this tool