Why this matters
Every production API must answer three questions at all times: How much traffic are we serving (RPS)? How many requests fail (errors)? How fast are responses (latency)? These metrics guide on-call decisions, capacity planning, and user experience.
- On-call triage: Spot traffic spikes or latency regressions before users complain.
- Release safety: Compare RPS, error rate, and p95 latency before/after a deploy.
- SLI/SLO tracking: Quantify reliability and speed with percentiles and budgets.
- Capacity planning: Ensure headroom for peak RPS without tail latency explosions.
Who this is for
- API Engineers and backend developers responsible for uptime and performance.
- SREs adding alerts and SLOs to services.
- Team leads needing crisp, shared language for incident review.
Prerequisites
- Basic HTTP knowledge (status codes, request/response).
- Comfort with simple math (rates, percentages, percentiles).
- Familiarity with metrics concepts like counters, gauges, and histograms.
Concept explained simply
Three signals: the RED method
- Rate (RPS): How many requests per second your API handles.
- Errors: What fraction of requests fail (usually 5xx).
- Duration (Latency): How long requests take, captured by percentiles (p50, p90, p95, p99).
Mental model
Imagine your service as a highway:
- RPS = number of cars entering per second.
- Errors = cars that break down on the road (5xx across the fleet).
- Latency = travel time. Averages hide traffic jams; percentiles expose tail delays.
Deep dive: Counters, gauges, histograms, summaries
- Counters: ever-increasing numbers (e.g., requests_total). Use rate() over time windows.
- Gauges: current values (e.g., in-flight requests).
- Histograms: bucketed observations; enable server-side percentiles and aggregation.
- Summaries: compute percentiles locally; cannot be aggregated across instances.
Key definitions
- RPS (requests per second): RPS = increase_in_requests / seconds_in_window.
- Error rate: Usually 5xx / total_requests over a window. Treat 4xx separately (client behavior).
- Latency percentiles: p50 = typical, p95 = slow-but-common, p99 = tail. Alerting on p99 is noisy; prefer multi-minute windows and burn-rate logic.
When to use which percentile?
- p50: Median experience; useful for regressions, not for paging.
- p95: Good balance for most end-user latency SLOs.
- p99: Critical for low-latency products; use for dashboards and well-tuned alerts.
How to measure
From counters
// Over a 5-minute window (300s)
RPS = (requests_total[t_now] - requests_total[t_5m_ago]) / 300
Error rate = (requests_5xx_delta / requests_total_delta)
From histograms (percentiles)
Find the bucket that contains the desired percentile, then interpolate inside the bucket if possible.
// Example buckets (cumulative counts)
<=50ms: 1500
<=100ms: 4300
<=200ms: 9200
<=400ms: 9800
<=+Inf: 10000
// p95 target rank = 0.95 * 10000 = 9500
// 200ms bucket has 9200, 400ms has 9800 → p95 is between 200 and 400ms
fraction = (9500-9200)/(9800-9200) = 300/600 = 0.5
p95 ≈ 200ms + 0.5*(400-200) = 300ms
Worked examples
Example 1: Compute RPS and error rate
In 5 minutes, requests_total increased by 18,000; http_5xx increased by 270.
- RPS = 18,000 / 300 = 60 req/s.
- Error rate = 270 / 18,000 = 0.015 = 1.5%.
Example 2: Estimate p95 latency from buckets
Using the histogram in the previous section, p95 ≈ 300 ms.
Example 3: Check an SLO against observed data
SLO: p95 latency ≤ 250 ms over 30 minutes. Observed p95 = 300 ms for two consecutive 5-minute windows.
- Result: SLO is not met for those windows; consider paging if breach is sustained (avoid single-window noise).
Common mistakes and self-check
- Using average latency: Averages hide tail slowness. Prefer p95/p99.
- Mixing 4xx with errors: 4xx reflect client behavior; usually exclude from server error rate.
- Alerting on p99 spikes over 1 minute: Too noisy. Use multi-window burn-rate or sustained breaches.
- Comparing different windows: Always state the window (e.g., 5m, 15m).
- Percent vs percentage points: An increase from 1% to 2% is +1 percentage point, a 100% relative increase.
- Not labeling units: Always include ms for latency, req/s for RPS.
Self-check
- Do your metric names include units or are units documented?
- Are counters correctly rate-converted over a fixed window?
- Are you distinguishing 5xx (server) from 4xx (client) in alerts?
- Do dashboards show p50, p95, and p99 side-by-side?
Exercises
These match the tasks in the exercise panel below. Try here first, then open the solutions when stuck.
Exercise 1: RPS and error rate
In the last 5 minutes: requests_total increased by 18,000; http_5xx increased by 270. Compute RPS and error rate.
Hint
- 5 minutes = 300 seconds.
- Error rate = 5xx_delta / total_delta.
Answer
RPS ≈ 60 req/s; error rate ≈ 1.5%.
Exercise 2: p95 from histogram
Cumulative bucket counts (ms): 50:1500, 100:4300, 200:9200, 400:9800, +Inf:10000. Estimate p95.
Hint
- Find the bucket containing the 95th percentile rank.
- Interpolate inside the bucket.
Answer
p95 ≈ 300 ms.
- Checklist: Include units (ms, req/s).
- Use fixed time windows (e.g., 5m) for rates.
- Separate 5xx from 4xx.
- Show p50, p95, p99 on one chart.
Practical projects
- Instrument a simple HTTP endpoint with metrics: requests_total, requests_5xx_total, request_duration histogram (ms). Add labels for method, route, status.
- Create a dashboard with three panels: RPS (5m rate), Error rate (5xx/total, 5m), Latency percentiles (p50/p95/p99).
- Write two alerts: sustained error rate > 2x budget over 5m and 30m; p95 latency > 1.5x target over 10m.
Learning path
- Next: Dashboards and alert rules (practical thresholds, burn-rate).
- Then: Tracing basics to connect high latency to slow database calls.
- Then: SLOs and error budgets for product-level commitments.
- Optional: Capacity planning with queueing intuition for tail latency.
Next steps
- Finish the exercises below.
- Take the Quick Test (everyone can take it; only logged-in users get saved progress).
- Apply one alert improvement to your service this week.
Mini challenge
In 10 minutes you served 120,000 requests; 5xx count = 600; p95 latency = 420 ms. Your targets: error rate ≤ 0.2%, p95 ≤ 350 ms. What’s your call?
Show solution
- Error rate = 600/120,000 = 0.5% → above 0.2% target.
- p95 = 420 ms → above 350 ms target.
- Action: Page if both are sustained; roll back recent changes or enable feature flag mitigation.