Why this matters
As an API Engineer, you ship services other teams and customers rely on. Clear latency budgets and SLAs help you design fast request paths, make tradeâoffs, and know when youâre at risk. Youâll use them to:
- Set realistic, measurable latency targets across services and dependencies.
- Decide when to optimize code vs. add caching vs. simplify request flows.
- Communicate performance promises to product, support, and customers.
- Detect regressions before they break your SLA and lead to credits.
Who this is for
Backend/API engineers, SREs, and tech leads who need to set, measure, and meet latency targets for userâfacing or partner APIs.
Prerequisites
- Basic HTTP/REST or gRPC knowledge.
- Familiarity with logging/metrics (timers, percentiles).
- Ability to read simple latency histograms and dashboards.
Concept explained simply
Latency budget is the time you allow for each part of a request so the total meets your target. Think of a request as a commute: you allot minutes for walking, bus, and elevator. If the bus is slower, you must walk faster, take an earlier bus, or change routes.
Key terms (open)
- SLI (Service Level Indicator): The measurement. Example: p95 request latency in ms.
- SLO (Service Level Objective): Your internal target. Example: p95 <= 300 ms over 28 days.
- SLA (Service Level Agreement): Customerâfacing commitment, often with credits. Example: 99.9% of requests <= 500 ms per calendar month.
- Latency budget: Allocation of time across the request path to meet an SLO/SLA.
Mental model
Draw the request path and assign a time cap for each hop so the sum fits the endâtoâend target with headroom. Measure actuals continuously and adjust allocations when reality changes.
Targets and percentiles
- Median (p50) shows typical speed; tails (p95, p99) show worst user experiences.
- Choose percentiles based on risk: external SLAs often use 99.9%; internal SLOs often use 95%â99% to balance engineering cost.
- Always specify the time window and scope: e.g., p95 over rolling 28 days, excluding client aborts.
Allocating budgets across the request path
Suppose the product SLO is: API p95 †300 ms (server time). A typical HTTP path:
- Edge/LB: 20 ms
- App service: 140 ms
- Cache: 10 ms
- Primary DB: 80 ms
- Downstream service: 30 ms
- Headroom: 20 ms
Sum = 300 ms. Headroom protects you from small spikes and GC pauses. If a hop regularly exceeds its budget, you either optimize it, compensate elsewhere, or adjust the design (e.g., precompute, cache, or make the call asynchronous).
Quick math pattern
Let L_total be the endâtoâend p95 target. Allocate L_total = ÎŁ(hop budgets) + headroom. Start with baselines (current p95 per hop), then set budgets slightly below baselines to create improvement pressure while staying feasible.
Measurement and instrumentation
- Collect SLIs with highâresolution timers and histograms (not just averages).
- Tag by endpoint, method, and dependency to see where time is spent.
- Use request IDs and distributed tracing to validate perâhop budgets.
- Define windows carefully: rolling 28 days or calendar month; document exclusions (e.g., client cancellations).
- Alert on burn rates: how quickly youâre consuming your error or latency budget.
Reliability tieâin
Error budgets track availability; latency budgets track performance. Both protect user experience and engineering velocity. Slower services can degrade reliability if clients retry aggressively, causing cascading load. Keep both budgets in view.
Worked examples
Example 1: Derive remaining server budget
Goal: Endâtoâend p95 †500 ms from client click to response. Known nonâserver times: client rendering 120 ms p95, network 70 ms p95, CDN 30 ms p95. What is your server p95 budget including headroom of 30 ms?
- Nonâserver total = 120 + 70 + 30 = 220 ms.
- Remaining = 500 â 220 = 280 ms.
- Server budget = 280 â 30 headroom = 250 ms.
So allocate †250 ms p95 for the server path.
Example 2: Split a server budget across dependencies
Server p95 budget: 250 ms. Current p95s: app 80, DB 120, cache 15, downstream service 60 = 275 ms (over!). Options:
- Reduce DB p95 to 100 ms via index + query cleanup (â20 ms).
- Make downstream call optional with cached fallback p95 25 ms (â35 ms on tail).
New estimated total: 80 + 100 + 15 + 25 = 220 ms, leaving 30 ms headroom.
Example 3: Check SLA wording and feasibility
Proposed SLA: â99.9% of requests †400 ms per calendar month.â Current p95 = 220 ms, p99 = 380 ms, p99.9 = 620 ms. Risk: p99.9 exceeds 400 ms. Options:
- Improve tail (e.g., timeouts, bulkheads, cache warming) to bring p99.9 under 400 ms.
- Adjust SLA to 500 ms or 99.5% under 400 ms.
Never publish an SLA you cannot meet with margin.
Hands-on exercises
These mirror the graded exercises below. Do them here first, then submit in the exercise section.
- Exercise 1: You own an API with SLO: server p95 †280 ms over 28 days. Observed p95s: LB 15, app 110, DB 140, downstream 55. Propose a feasible budget allocation and two optimizations to fit the SLO with 20 ms headroom.
- Exercise 2: SLA proposal: â99.9% of requests †300 ms per calendar month.â You have: p95 160, p99 260, p99.9 420. Should you accept this SLA? If not, specify a safer SLA or engineering actions.
Checklist before you ship
- Endâtoâend target and percentile are explicitly stated.
- Budgets assigned for each hop, with at least 10% headroom.
- SLI definitions include window and exclusions.
- Dashboards show histograms and perâhop timings.
- Alerts tied to burn rate or tail latency spikes.
- Fallbacks and timeouts are tested in productionâlike conditions.
Common mistakes and self-checks
- Using averages instead of percentiles. Selfâcheck: Do you see p95/p99 on dashboards?
- No headroom. Selfâcheck: Is there â„10% unallocated time?
- Unbounded fanâout. Selfâcheck: Do concurrent downstream calls cap concurrency?
- Retry storms. Selfâcheck: Are retries capped and jittered? Do timeouts beat SLA limit?
- Cold paths ignored. Selfâcheck: Have you measured cold start, cache miss, GC?
Practical projects
- Instrument an endpoint with histograms and produce a monthly SLO report with p50/p90/p95/p99.
- Draw a request path and assign budgets; then run a load test to validate allocations.
- Add a fallback/cache to a slow dependency and compare p99.9 before/after.
Learning path
- First: Latency budgets and SLAs (this lesson).
- Next: Caching strategies, timeouts/retries, backpressure.
- Then: Capacity planning and load testing for tail latency.
Next steps
- Define SLIs and SLOs for one critical endpoint this week.
- Add an alert when p99 exceeds 80% of your SLA limit for 15 minutes.
- Review your dependency graph; add caps and timeouts where missing.
Mini challenge
Pick one endpoint. Reduce its p99 by 25% without changing the p50. Hint: focus on the slowest 1%: cache misses, noisy neighbors, retries, and DB hotspots.
Quick Test
Available to everyone. If you log in, your progress and score are saved.