Who this is for
- Backend Engineers building or operating APIs, services, and jobs.
- Developers integrating thirdâparty APIs (payments, email, SMS, maps).
- Anyone preparing for system design interviews who wants real-world tactics.
Prerequisites
- Basic HTTP and REST knowledge.
- Familiarity with latency and throughput concepts.
- Comfort with logs and metrics (latency percentiles, error rates).
Why this matters
Production systems fail: networks drop packets, dependencies slow down, and jobs crash. Without good failure handling and timeouts, your service can hang threads, overload databases, and create cascading outages.
Real tasks you will do on the job:
- Set per-call timeouts and retries for thirdâparty APIs.
- Implement circuit breakers to stop hammering an unhealthy dependency.
- Design idempotent endpoints and workers to avoid duplicate work or charges.
- Define a request timeout budget and propagate deadlines to downstream calls.
- Gracefully degrade when partial data is available (e.g., show cached prices if live pricing is slow).
Concept explained simply
Think of your service like a restaurant kitchen. Orders arrive; some ingredients (dependencies) run out or take longer. You set rules: how long to wait, when to try again, when to stop taking certain orders, and what simplified menu to serve if a station is down. Thatâs failures and timeouts.
Mental model
- Timeouts: A timer on each step. If it rings, stop waiting and move on.
- Retries with backoff + jitter: Try again later, but wait longer each time and add randomness to avoid thundering herds.
- Circuit breaker: A switch that opens after too many failures to protect your system; halfâopen lets a few test requests through; closed is normal.
- Idempotency: Repeating the same request produces the same effect (safe to retry).
- Bulkheads: Separate resource pools so one failing dependency doesnât sink the whole ship.
- Graceful degradation: Offer a simpler but acceptable result when the ideal path fails.
- Deadline propagation: Pass the remaining time budget to downstream services so everyone cooperates.
Deep dive: Timeouts vs. deadlines vs. cancellation
- Per-call timeout: Maximum time to spend on one operation.
- Deadline: Absolute time when the entire request must finish; propagate to downstream calls.
- Cancellation: Stop work when the client times out or disconnects; free resources immediately.
Core patterns and tools
- Retry policy: exponential backoff (e.g., base 100â200 ms) + jitter; cap retries and total time.
- Perâdependency timeouts: set based on observed p95âp99 latencies with safety margin.
- Circuit breaker: failure rate and slow call rate thresholds; time window; halfâopen probing.
- Idempotency: keys, unique constraints, upserts, put-if-absent, or dedupe stores with TTL.
- Bulkheads: separate thread pools/connection pools per dependency.
- Graceful degradation: cached data, defaults, staleâwhileârevalidate, hide nonâcritical widgets.
- Hedging (advanced): duplicate a request after a delay to reduce tail latency; use only for idempotent reads.
Recommended sensible defaults
- Client-to-service HTTP timeout: 2â5 s for user requests, 10â30 s for batch jobs.
- External API calls: 1â3 s timeout; maximum 1â2 retries with backoff + jitter.
- Backoff: 100 ms, then 300 ms, then 900 ms (cap at ~1â2 s), add ±50% jitter.
- Circuit breaker: open on â„50% failures over 20â50 requests or slow-calls > p99 threshold; halfâopen after 10â30 s.
- Dedupe TTL: align to business window (e.g., 24 h for payment idempotency keys).
Worked examples
1) Payment provider is flaky for a few seconds
- Set perâattempt timeout: 1.5 s (provider p95 â 700 ms, p99 â 1.2 s).
- Retries: at most 1 (with exponential backoff 200 ms ± jitter).
- Idempotency: require a clientâgenerated idempotency key; store result keyed by that value for 24 h.
- Outcome: either quick success, a single retry, or a fast fail within ~3 s worst case.
Why not more retries?
Every retry increases load on a struggling provider and lengthens user wait time. A single retry recovers transient hiccups without creating a retry storm.
2) Circuit breaker for a slow inventory service
- Open the breaker if failure rate â„ 50% or if â„ 60% calls exceed 800 ms over the last 50 requests.
- When open: shortâcircuit requests and serve cached inventory for 30 s.
- Halfâopen after 30 s: allow 5 trial requests; if successful, close breaker; otherwise reâopen.
Graceful degradation plan
- Display stock status from cache; disable ârealâtime availabilityâ badge.
- Record a metric tag "degraded=true" for observability.
3) Fanâout aggregator with a timeout budget
- Overall request budget: 1,000 ms. Local processing needs â 150 ms.
- Remaining 850 ms split across services A, B, C. Allocate 250 ms each + 100 ms slack.
- Propagate a deadline to A/B/C; cancel downstream work when the deadline expires.
- Return partial results if any service times out; mark fields as "partial".
Self-check
- Do you avoid starting work that cannot finish before the deadline?
- Do you release threads and connections immediately on timeout?
4) Idempotent job worker
- Before processing job J with key K, attempt put-if-absent(K) in a fast store (TTL 24 h).
- If present, skip (duplicate). If absent, process then mark result status under K.
- On crash midâprocess: job will retry; put-if-absent prevents duplicate side effects if you reach the sideâeffect step again.
Common mistakes (and how to selfâcheck)
- No timeouts: Hanging calls pile up threads and sockets. Selfâcheck: grep configs for missing timeouts; ensure every I/O has explicit timeouts.
- Unlimited retries: Causes retry storms. Selfâcheck: enforce max retries and total time budget.
- Nonâidempotent retries: Duplicated charges/orders. Selfâcheck: require idempotency keys or unique constraints for sideâeffects.
- Single shared pool: One slow dependency consumes all threads. Selfâcheck: separate pools (bulkheads) per dependency.
- No deadline propagation: Downstream keeps working after the client left. Selfâcheck: pass cancellation/deadline contexts.
- Ignoring partial responses: Fail closed when a partial would have worked. Selfâcheck: clearly define acceptable degraded outputs.
Exercises
Hands-on practice. You can compare with the solutions at the end of each exercise.
Exercise 1: Design a timeout and retry policy for an email API
Constraints:
- Observed latencies: p50=120 ms, p95=400 ms, p99=800 ms.
- User request budget: 1,200 ms total; other work needs 400 ms.
- API is idempotent if you pass an Idempotency-Key.
Task: Propose perâattempt timeout, number of retries, backoff schedule, jitter, and how you will ensure idempotency. Keep within budget.
Show solution
Budget for email call: ~800 ms (1,200 â 400). Choose:
- Perâattempt timeout: 450 ms (covers p95, allows one retry).
- Retries: 1 max.
- Backoff: 150 ms ± 50% jitter.
- Total worstâcase: 450 + 150 + 450 = 1,050 ms (fits 800 ms if you start early; if tight, drop timeout to 400 ms).
- Include Idempotency-Key header; store send result keyed by that value for 24 h.
If you must hardâcap at 800 ms: use perâattempt 350 ms, backoff 100 ms, 1 retry â 350 + 100 + 350 = 800 ms.
Exercise 2: Make a worker idempotent
Your job processes invoices and charges a card. Retries can happen. Design idempotency so the customer is never doubleâcharged.
- Inputs: invoice_id, amount, card_token
- Payment gateway supports an idempotency key
Task: Outline the exact steps and data structures you will use.
Show solution
- Compute key K = "charge:" + invoice_id.
- Attempt put-if-absent(K) in a fast store (e.g., Redis) with TTL 48 h; value = "processing".
- If present and status = "success", skip (already charged). If "processing", delay/backoff and retry.
- Call gateway with Idempotency-Key=K and perâattempt timeout.
- On success, update store K â {status: "success", charge_id} and persist mapping in DB with a unique constraint on invoice_id.
- On failure/timeouts: if retriable, retry; if terminal, set K â {status: "failed"}.
Checklist to validate your designs
- Every external call has a perâattempt timeout.
- Retries are capped and include exponential backoff + jitter.
- Idempotency keys and/or unique constraints prevent duplicate sideâeffects.
- Circuit breaker thresholds are defined and tested.
- Deadline/cancellation is propagated to all downstream calls.
- There is a clear degraded response plan.
Practical projects
- Build a tiny aggregator service that calls two public mock endpoints; implement a 1 s deadline, partial responses, and a circuit breaker for one dependency.
- Write a job worker that reads tasks from a queue, performs a sideâeffect, and guarantees idempotency using a dedupe store + database unique constraint.
- Add an optional hedging feature to a readâonly API client; enable only for idempotent GETs and measure p99 improvements in a local test harness.
Mini challenge
Your service has a 900 ms SLO. You fanâout to two services in parallel, each with p99 350 ms. Propose a budget split, timeouts, and a fallback if one times out. Keep your solution within the 900 ms envelope.
One possible answer
- Local work: 100 ms; remaining 800 ms.
- Per-call timeouts: 350 ms each; allow 50 ms slack; no retries in the fanâout path.
- Deadline propagation: 800 ms budget passed downstream.
- Fallback: return data from whichever service responds; mark missing fields as "unavailable" and log a degradation metric.
Learning path
- Start with timeouts and retries (this page).
- Add circuit breakers and bulkheads.
- Implement idempotency for writes and jobs.
- Practice deadline propagation and partial responses.
- Evaluate advanced patterns (hedging, request collapsing) only after basics are stable.
Next steps
- Apply these patterns to one real endpoint this week; ship behind a feature flag.
- Instrument metrics: timeouts, retry counts, breaker state, and degraded responses.
- Run a failure injection drill: simulate slow dependency and verify graceful degradation.
Check your understanding
The quick test for this subskill is available to everyone. Only loggedâin users will have their progress saved.
When youâre ready, use the Go to Quick Test action below.