Why this matters
As an API Engineer, you are responsible for the reliability of endpoints customers use every minute. A clear, actionable dashboard helps you answer: Is it up? Is it fast? Is it throwing errors? Who is affected? With the right dashboard, you reduce time-to-detect and time-to-recover, make data-driven decisions, and communicate status quickly during incidents.
- Daily work: verify deployment impact on latency and errors
- On-call: triage spikes in 5xx and slow p95/p99
- Planning: track SLOs and find noisy endpoints to optimize
- Collaboration: share a single view for backend, SRE, and product teams
Concept explained simply
An endpoint health dashboard is a single page showing the RED metrics per endpoint: Rate (traffic), Errors, and Duration (latency). It also shows dependencies and saturation so you can see the cause, not just the symptom.
- Rate: requests per second per endpoint and method
- Errors: proportion and distribution of 4xx/5xx with top error codes
- Duration: latency percentiles (p50, p95, p99) overall and by region/version
- Saturation: resource pressure of the service or backing systems (CPU, memory, DB connections)
- Dependencies: upstream/downstream components that can degrade the endpoint
Mental model
Think of the dashboard as a funnel:
- Symptom at the top: is the endpoint failing or slow?
- Scope: which path, method, region, version, or customer segment?
- Cause: what changed (deploys), which dependency is failing, who is saturated?
- Action: page the right owner and follow the runbook steps.
Core metrics and panels to include
- Availability: percent of successful requests (e.g., non-5xx) per endpoint
- Latency: p50/p95/p99 over time; separate server time from network if possible
- Error rate: percentage with breakdown by status code and top error labels
- Traffic: RPS by method and endpoint; highlight sudden jumps or drops
- Saturation: CPU/memory for services; DB connections; thread pools; queue depth
- Dependency health: upstream/downstream error and latency for the same time window
- Annotations: deploys, feature flags, incidents
- SLI vs SLO: show SLI lines with target thresholds as reference bands
Tip: Sane defaults for percentiles
- p50 shows the typical request
- p95 shows most user experience under mild load
- p99 surfaces tail issues that harm reliability
Data sources and SLO thresholds
- API gateway or load balancer metrics/logs for rate, status code, and latency
- Service metrics for internal timings and saturation
- Tracing for per-request breakdowns and slow spans
- Uptime/synthetic checks to validate external reachability
- Logs for errors, stack traces, and rare edge cases
Example SLO targets (adapt to your product):
- Availability: 99.9% monthly (allowed error budget ≈ 43 minutes/month)
- Latency: p95 < 300 ms for read endpoints; p95 < 600 ms for write endpoints
- Error rate: < 1% 5xx on critical endpoints
Use these as starting points. Calibrate with product impact and existing performance.
Worked examples
1) Spike in 5xx on /checkout
- Panels to check: error rate by status, RPS, latency p95, service saturation, DB errors
- Look for: sudden deploy annotation near spike; DB connection exhaustion; timeouts
- Action: rollback or feature-flag off; scale DB pool; update runbook with findings
2) Latency regression for EU users on /search
- Panels to check: latency by region, CDN/cache hit ratio, upstream dependency latency
- Look for: region-specific dependency issues; cache misses; network path changes
- Action: warm cache; route EU traffic to nearest healthy region; investigate dependency
3) Third-party outage impacting /payments
- Panels to check: 502/504 rates, third-party call latency/errors, circuit breaker state
- Look for: elevated upstream error rate; increased retries; open circuits
- Action: enable fallback; reduce retry storm; communicate degraded mode to stakeholders
Build a minimal endpoint health dashboard in 30 minutes
- Choose top 3–5 endpoints by business impact (e.g., /login, /checkout, /orders).
- Add RED panels per endpoint:
- [ ] RPS (stacked by method)
- [ ] Error rate % (stacked by status class)
- [ ] Latency p50/p95/p99
- Slice by key dimensions:
- [ ] Region
- [ ] API version
- [ ] Client type (mobile/web/service)
- Add dependency panels:
- [ ] DB latency and errors
- [ ] External API dependency health
- Overlay deploy annotations and feature flags.
- Display SLI lines with SLO target bands (e.g., p95 < 300 ms).
- Add a small runbook note: "If 5xx > 2% for 5 minutes → check deploys, DB pool, third-party status."
- Place a mini "Triage now" checklist:
- [ ] Identify affected endpoint/method
- [ ] Confirm scope (region/version/client)
- [ ] Check last deploy/flag
- [ ] Inspect dependency health
- [ ] Page the correct owner (add contact)
Alert hygiene and runbooks
- Alert on user impact (SLI) not raw CPU: e.g., 5xx > 2% for 5 minutes, p95 latency breach for 10 minutes
- Use multi-burn-rate SLO alerts (fast + slow) to catch both acute and smoldering issues
- Group alerts by endpoint; avoid duplicate pages across services
- Keep a visible runbook panel with first steps, common causes, and escalation path
- Annotate incidents so postmortems tie back to visual evidence
Runbook template snippet
- Validate impact: SLI panels and user reports
- Check recent changes: code deploys, config, flags
- Compare regions/versions to isolate scope
- Inspect dependencies and saturation
- Mitigate: rollback, scale, disable feature, rate-limit
- Document findings on dashboard notes
Common mistakes and self-check
- Only showing averages: hides tail latency. Self-check: do you have p95 and p99?
- Mixing endpoints in one panel without filters: hides which path is broken. Self-check: can you isolate by endpoint and method?
- No time-aligned dependency panels: slows root cause. Self-check: do service and DB graphs share the same time range?
- Alerting on infrastructure only: misses user pain. Self-check: do alerts use SLI thresholds?
- Too many panels: slows triage. Self-check: can an on-caller decide within 2 minutes?
Exercises
These mirror the exercises below. Do them hands-on with your metrics tool or on paper.
Exercise 1: Design a dashboard for /orders
Scenario: Traffic spikes during promotions. Users report timeouts on /orders POST. You have gateway metrics (RPS, status, latency), service metrics (CPU, threads), DB metrics (connections, slow queries), and tracing.
Task: List the panels you will add, the slices (dimensions), and the SLO target lines. Write a triage checklist of 5 bullet points.
- [ ] Panels listed (RED + dependencies)
- [ ] Slices: region/version/client
- [ ] SLO bands drawn
- [ ] Triage checklist written
- [ ] One alert condition defined
Practical projects
- Project A: Build a "Critical Endpoints" dashboard for the top 3 user flows. Include RED metrics, dependency health, deploy annotations, and a mini runbook.
- Project B: Add SLO widgets with fast/slow burn alerts for one endpoint. Document how false positives were reduced.
- Project C: Create a "Regional View" sub-dashboard comparing latency p95 and 5xx between two regions and propose routing or caching changes.
Who this is for
- API Engineers responsible for production reliability
- SREs supporting backend services
- Developers on-call for customer-facing endpoints
Prerequisites
- Basic HTTP and REST knowledge
- Familiarity with metrics (counters, gauges, histograms) and logs
- Access to your org’s metrics/tracing/uptime tools (or sample data)
Learning path
- Before this: Request/response metrics, structured logging, basic tracing
- This subskill: Build grounded dashboards for endpoint health with RED + SLO
- After this: Alerting strategy with SLO burn rates, incident response playbooks, capacity and performance tuning
Next steps
- Instrument one more endpoint and compare trends week over week
- Connect your dashboard to clear owners and on-call rotations
- Schedule a quarterly dashboard review to prune and improve panels
Mini challenge
Pick one high-impact endpoint. In 25 minutes, add a p95 latency panel split by region and draw an SLO band. In the next incident, note whether this panel sped up your triage.
Save your progress
The quick test is available to everyone. If you are logged in, your progress will be saved automatically.
Quick Test
Take the short test below to reinforce what you learned.