How to learn Health Checks And Readiness Probes for Observability And Operations in Backend Engineer for free

Who this is for

Backend and platform engineers deploying services to containers, VMs, or serverless.
Developers responsible for keeping services healthy behind load balancers or orchestrators like Kubernetes.
Engineers preparing for on-call rotations and reliability-focused interviews.

Prerequisites

Basic HTTP knowledge (status codes, routes).
Familiarity with your app's dependencies (database, cache, queues).
Optional but helpful: container basics and Kubernetes or another orchestrator.

Why this matters

Health checks and readiness probes decide if traffic should be sent to your service and when it should restart. Well-designed checks prevent incidents like:

New versions receiving traffic before migrations finish.
Crash loops caused by liveness checks that are too strict.
Healthy pods staying in rotation during dependency outages, causing timeouts for users.

Real tasks you will do on the job

Expose /healthz and /readyz endpoints with clear signals.
Configure Kubernetes liveness, readiness, and startup probes with sensible timeouts and thresholds.
Coordinate graceful shutdown so in-flight requests finish before the pod is removed from rotation.
Differentiate critical vs optional dependencies in readiness logic.

Concept explained simply

Think of your service as a restaurant:

Liveness: “Is the building still standing and staff awake?” If not, restart.
Readiness: “Are we open for customers right now?” If the oven (DB) is broken, don’t seat new guests.
Startup: “Are we fully prepared to open?” Don’t take customers while preheating or setting tables.

Mental model

Liveness checks should be minimal and fast, focused on the process itself.
Readiness checks determine if the instance can serve traffic safely given its dependencies.
Startup checks gate the initial warm-up so liveness/readiness don’t flap during boot.

Key definitions and signals

Liveness: Return 200 if the process is responsive and event loop/thread pool is not deadlocked. Avoid heavy work.
Readiness: Return 200 only if critical dependencies are available and the instance is ready to accept traffic. Otherwise 503.
Startup: Return 200 once migrations, caches, and warm-up are done; until then return 503.
HTTP codes: 200/204 for healthy; 503 for temporarily not ready; 500 for internal failure (use sparingly for liveness).
Latency budgets: Keep checks under 100–300 ms to avoid self-inflicted timeouts.

Design principles

Keep liveness dumb and cheap. Don’t call external systems.
Make readiness reflect user impact. Only include dependencies that must be healthy for safe traffic.
Use startup probes during warm-up (migrations, JIT, cache warm) to prevent early restarts.
Prefer timeouts to direct failures. A failing dependency should return 503 quickly instead of hanging.
Fail closed for critical dependencies, fail open for optional ones (return degraded=true in body, but 200 status).
Include build/version and uptime in responses to aid debugging.
During shutdown: mark not-ready first, wait drain period, then stop.

Worked examples

Example 1: REST API depending on Postgres (critical) and Redis (optional)

Endpoints:

{
  "GET /healthz": 200 if event loop responsive and self-check ok,
  "GET /readyz": 200 if Postgres ping <100ms and pool has spare connections; Redis optional,
  "GET /startupz": 200 only after migrations complete and HTTP server finished warm-up
}

Behaviors:

Postgres down: /healthz=200 (stay alive), /readyz=503 (stop receiving traffic).
Redis down: /readyz=200 with body {"degraded": true, "redis": "down"}.
During migrations: /startupz=503 so Kubernetes waits before checking readiness.

Example 2: Worker consuming from a queue

Liveness: 200 if worker loop heartbeat updated within last 10s.
Readiness: 200 if connected to queue and can ack a test message or declare a heartbeat queue; otherwise 503.
Startup: 200 after handlers registered and backoff scheduler initialized.

Why: The worker can be alive but not ready if the queue is unreachable.

Example 3: Feature flags and third-party API

Third-party payments are not critical for most endpoints.
Readiness returns 200 if core DB is healthy; includes field payments_api: "degraded" when failing.
Route-level checks: Optionally guard payment routes with a fast preflight check or circuit breaker.

Implementation patterns

HTTP responses to aim for

GET /healthz  -> 200 {"status":"ok","uptime":1234,"version":"1.2.3"}
GET /readyz   -> 200 or 503; body includes dependency states
GET /startupz -> 200 when warm; 503 during boot/migrations

Kubernetes probe template (tune values)

livenessProbe:
  httpGet: { path: "/healthz", port: 8080 }
  initialDelaySeconds: 10
  timeoutSeconds: 1
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet: { path: "/readyz", port: 8080 }
  initialDelaySeconds: 5
  timeoutSeconds: 1
  periodSeconds: 5
  failureThreshold: 3
startupProbe:
  httpGet: { path: "/startupz", port: 8080 }
  failureThreshold: 30
  periodSeconds: 5

Graceful shutdown (sequence)

Receive SIGTERM.
Start returning 503 on /readyz.
Stop accepting new requests; keep serving in-flight.
After drain timeout, exit cleanly.

Common mistakes and self-check

Putting DB calls in liveness: causes restart loops during transient DB outages. Fix: DB checks belong in readiness.
Skipping startup probe: readiness flaps during migrations and triggers unnecessary rollbacks.
Slow checks: a 2s readiness check can cascade into timeouts under load. Keep checks under ~100–300 ms.
Binary readiness only: return 200/503 plus a body with per-dependency states for diagnosis.
Not handling shutdown: pods keep receiving traffic and drop requests. Always mark not ready before exit.

Self-check

Can you kill Redis and still keep readiness=200 (degraded) and user-critical paths working?
Does a DB outage switch readiness to 503 quickly (under your SLO) without restarting the pod?
On deploy, do you avoid traffic until warm-up and migrations complete?

Exercises

Do these hands-on tasks. The quick test is available to everyone. Only logged-in users will have their progress saved.

Exercise 1: Design health endpoints for an API

Service: HTTP API with critical Postgres, optional Redis cache, and optional Payments API.

Define responses for /healthz, /readyz, /startupz.
Decide which dependencies affect each endpoint.
Sketch a Kubernetes probe config (timeouts, periods, thresholds).

Checklist

Healthz is dependency-free and fast.
Readyz fails fast when Postgres is down.
Startupz blocks until migrations and warm-up complete.
Probe timeouts <= 1s; readiness period ~5s; liveness period ~10s.

Exercise 2: Worker readiness logic

Service: Background worker pulling jobs from a queue (critical). Metrics endpoint is on port 9090.

Propose how /healthz and /readyz should behave.
Write a probe config ensuring the worker is removed from rotation if it loses queue connectivity.
Explain how you detect the worker loop is alive (no deadlock).

Checklist

Liveness checks internal heartbeat, not the queue.
Readiness depends on queue connectivity.
Startup waits for handler registration and backoff scheduler.

Practical projects

Project A: Build a tiny service exposing /healthz, /readyz, /startupz. Simulate DB up/down with a flag and observe Kubernetes probe behavior locally (kind/minikube or docker-compose healthchecks).
Project B: Add graceful shutdown. On SIGTERM, flip readiness to 503, wait 10 seconds, then exit. Validate no requests are dropped with a load generator.

Learning path

Implement basic /healthz with uptime and version.
Add /readyz that checks only critical dependencies with tight timeouts.
Add /startupz to gate warm-up and migrations.
Configure orchestrator probes; test failure modes (DB down, cache down, third-party down).
Implement graceful shutdown and verify draining.
Instrument metrics (counters for readiness failures, check latencies) to observe behavior.

Next steps

Extend readiness to report per-dependency status in JSON.
Add circuit breakers around flaky dependencies; keep readiness focused on truly critical ones.
Define alerting on readiness flap rate and liveness restarts.

Mini challenge

Your API has a cron-triggered daily DB maintenance lasting 2 minutes (queries slow to 2s). How do you avoid timeouts for users without restarting pods? Write the readiness behavior, probe timeouts/periods, and any feature flags or degraded modes you will use.

Menu

Health Checks And Readiness Probes

Table of Contents

Who this is for

Prerequisites

Why this matters

Concept explained simply

Mental model

Key definitions and signals

Design principles

Worked examples

Implementation patterns

Common mistakes and self-check

Exercises

Exercise 1: Design health endpoints for an API

Exercise 2: Worker readiness logic

Practical projects

Learning path

Next steps

Mini challenge

Practice Exercises

Design health endpoints for an API

Instructions

Expected Output

Worker readiness logic

Health Checks And Readiness Probes — Quick Test

Have questions about Health Checks And Readiness Probes?

AI Assistant