Menu

Topic 2 of 8

Monitoring Dependencies

Learn Monitoring Dependencies for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Why this matters

Modern backends rely on databases, caches, message brokers, DNS, and third‑party APIs. When any dependency slows down or fails, your service is blamed. Monitoring dependencies lets you spot issues early, reduce blast radius, and roll out graceful fallbacks instead of outages.

  • Real tasks you will do: set SLIs/SLOs for dependencies, alert on burn rate, add synthetic checks, track connection pool saturation, implement circuit breakers and retries, and build dashboards that correlate upstream latency with downstream performance.
  • Result: fewer incidents, faster mitigation, and measurable reliability.

Concept explained simply

Dependency monitoring is watching the health, speed, and errors of external systems your service uses. Use three lenses:

  • Black‑box: call it like a user would (synthetic checks).
  • White‑box: read internal metrics (pool wait time, replica lag).
  • Tracing: follow a request across services with a correlation ID.

Mental model

Picture a pipeline: Your service sits in the middle. Each dependency is a valve that can become slow, leaky, or clogged. Measure flow (throughput), pressure (latency), leaks (errors), and valve limits (saturation). Keep budgets per valve: when a valve misbehaves, protect the rest with timeouts, retries, and breakers.

What to monitor by dependency type

Databases (e.g., Postgres, MySQL)
  • Client-side: query latency (p50/p95/p99), error rate, connection pool wait time/saturation, timeouts.
  • Server-side: CPU/IO pressure, locks/deadlocks, replication lag, slow queries.
Caches (e.g., Redis, Memcached)
  • Hit ratio, command latency (p95/p99), evictions, timeouts, reconnects.
Message brokers/streams (e.g., Kafka, RabbitMQ)
  • Consumer lag, publish/consume error rate, broker availability, throughput, rebalances.
Third‑party APIs (payments, auth, email)
  • Tail latency (p95/p99), HTTP error rate by endpoint, timeouts, rate limit responses, circuit breaker state.
Network, DNS, TLS
  • DNS resolution time/failures, TLS expiry, handshake latency, packet loss between zones/regions.

Key SLIs for dependencies

  • Availability: successful calls / total calls (by dependency and endpoint).
  • Latency: p95 and p99 per dependency; also connection acquisition time for pools.
  • Errors: timeouts, 5xx, 4xx that matter (e.g., rate limits), exceptions by type.
  • Saturation: connection pool in use vs max, queue backlog/lag, thread/FD limits.

Pair SLIs with SLOs and alert on burn rate rather than single spikes to reduce noise.

Alerting strategy for dependencies

  • Multi-window burn rate: short window (fast detection) + long window (sustained issue).
  • Actionable thresholds: alerts that suggest next action (e.g., enable fallback, scale pool).
  • Symptom-based: alert on user impact metrics, then include dependency panels for diagnosis.

Step 1: Define SLOs per dependency (e.g., 99.9% success, p95<300ms).

Step 2: Instrument client metrics with labels: dependency, endpoint, region, outcome.

Step 3: Add timeouts, capped retries with jitter, and circuit breaker.

Step 4: Build synthetic checks to key endpoints/queries.

Step 5: Create dashboards: SLIs, saturation, recent changes (deploys, config).

Step 6: Alert on burn rate; page only when action is required.

Worked examples

Example 1 — Postgres dependency
  • Instrument: db_client_pool_wait_seconds (p95), db_query_duration_seconds (p95/p99), db_errors_total labeled by error, db_conn_in_use/max.
  • SLO: p95 query < 120ms, error rate < 0.5%.
  • Synthetic: a lightweight "SELECT 1" from app container with 200ms timeout.
  • Alerts: pool_wait p95 > 50ms for 5m AND in_use/max > 0.9; error rate > 1% for 10m.
  • Runbook: increase pool, reduce per-request concurrency, add/optimize indexes, check slow queries, verify replica lag.
Example 2 — Payment provider API
  • Instrument: http_client_requests_total by endpoint/status, http_client_request_duration_seconds (p95/p99), retry_count, breaker_state.
  • Controls: per-try timeout 800ms, max 2 retries with exponential backoff + jitter, circuit breaker opens when error rate > 20% for 2m.
  • Fallback: queue charges for later and show “pending” to users.
  • Alerts: burn-rate on success SLO 99.5%; special alert if rate-limited responses spike.
  • Dashboard: overlay provider status webhook with your SLIs to confirm external cause.
Example 3 — Kafka consumer
  • Instrument: consumer_lag, rebalance_count, processing_latency, dead_letter_messages_total.
  • SLO: 99% of messages processed within 5 minutes of publish.
  • Alerts: lag > threshold based on throughput (e.g., > 5Ă— rate for 10m), sustained rebalances.
  • Runbook: scale consumers, check partition skew, inspect slow processing, verify broker health.

Exercises

Do these now. They mirror the graded exercises below and build confidence.

  1. Exercise 1 — Build a dependency dashboard slice (ex1)
    Given a service that uses Postgres, Redis, and a Payment API, choose SLIs and write two actionable alerts. Include labels for dependency and region.
  2. Exercise 2 — Resilient HTTP synthetic check (ex2)
    Create a command or small script that checks an endpoint with a timeout, capped retries, and jittered backoff. Emit a single-line result including latency and status.
  • [ ] I selected success, latency, and saturation metrics per dependency.
  • [ ] I added per-try timeouts and capped retries.
  • [ ] I wrote at least one burn-rate style alert.
  • [ ] My synthetic check returns a clear pass/fail line.

Common mistakes and how to self-check

  • Only tracking uptime. Fix: add tail latency and pool wait time.
  • No timeouts or unbounded retries. Fix: per-try timeout + limited retries + jitter.
  • Aggregating across endpoints. Fix: label by endpoint and dependency to find hotspots.
  • Chasing cause over symptom. Fix: alert on user-impact metrics; use dependency panels to diagnose.
  • No correlation IDs. Fix: propagate a request ID across all dependency calls and into logs/traces.

Self-check: induce a safe test fault in non-prod (add 300ms delay to a dependency or lower a connection limit). Does your dashboard reveal the problem? Do alerts fire appropriately without paging for a blip?

Practical projects

  • Build a "Dependency Health" dashboard: top 5 dependencies with p95 latency, error rate, saturation, and a sparkline for last 6 hours.
  • Implement a client wrapper with timeouts, retries (exponential + jitter), and a circuit breaker; export metrics by dependency and endpoint.
  • Add a synthetic canary that runs from two regions and posts results to your metrics system.

Who this is for

  • Backend and platform engineers responsible for production services.
  • On-call engineers improving reliability and incident response.

Prerequisites

  • Basic HTTP and networking knowledge.
  • Familiarity with your metrics/logs/tracing stack.
  • Ability to change client configs (timeouts, retries) in your service.

Learning path

  1. Instrument client metrics for each dependency.
  2. Add timeouts, retries, and circuit breaker with safe defaults.
  3. Create synthetic checks and tag results with dependency and region.
  4. Define SLOs and set burn-rate alerts.
  5. Run a latency injection experiment; tune thresholds.

Next steps

  • Complete the exercises below and compare with the solutions.
  • Take the quick test to confirm understanding. The test is available to everyone; only logged‑in users get saved progress.
  • Apply one improvement in your service today (e.g., add per-try timeouts).

Mini challenge

Pick one dependency. In 60 minutes, add p95 latency, error rate, and a synthetic check to your dashboard. Create one actionable alert with a runbook link. Share the before/after with your team.

Ready to test yourself?

Take the quick test below. The test is available to everyone; only logged‑in users get saved progress.

Practice Exercises

2 exercises to complete

Instructions

Your service uses Postgres, Redis, and a Payment API in two regions: eu-west and us-east. Choose SLIs and write two alerts that would be actionable during an incident.

  1. List 3–5 SLIs per dependency (include p95 latency, success rate, and one saturation metric).
  2. Propose two alerts using clear conditions. One should be a burn-rate style alert, the other a saturation alert.
  3. Include labels: dependency, endpoint (where applicable), and region.

Assume you export Prometheus-style metrics like http_client_request_duration_seconds and db_client_pool_wait_seconds.

Expected Output
- A short list of SLIs per dependency with labels. - Two alert statements with thresholds and windows. - A one-line action hint for each alert (what the on-call should do first).

Monitoring Dependencies — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Monitoring Dependencies?

AI Assistant

Ask questions about this tool