How to learn Handling Failures And Timeouts for System Design Basics in Backend Engineer for free

Who this is for

Backend Engineers building or operating APIs, services, and jobs.
Developers integrating third‑party APIs (payments, email, SMS, maps).
Anyone preparing for system design interviews who wants real-world tactics.

Prerequisites

Basic HTTP and REST knowledge.
Familiarity with latency and throughput concepts.
Comfort with logs and metrics (latency percentiles, error rates).

Why this matters

Production systems fail: networks drop packets, dependencies slow down, and jobs crash. Without good failure handling and timeouts, your service can hang threads, overload databases, and create cascading outages.

Real tasks you will do on the job:

Set per-call timeouts and retries for third‑party APIs.
Implement circuit breakers to stop hammering an unhealthy dependency.
Design idempotent endpoints and workers to avoid duplicate work or charges.
Define a request timeout budget and propagate deadlines to downstream calls.
Gracefully degrade when partial data is available (e.g., show cached prices if live pricing is slow).

Concept explained simply

Think of your service like a restaurant kitchen. Orders arrive; some ingredients (dependencies) run out or take longer. You set rules: how long to wait, when to try again, when to stop taking certain orders, and what simplified menu to serve if a station is down. That’s failures and timeouts.

Mental model

Timeouts: A timer on each step. If it rings, stop waiting and move on.
Retries with backoff + jitter: Try again later, but wait longer each time and add randomness to avoid thundering herds.
Circuit breaker: A switch that opens after too many failures to protect your system; half‑open lets a few test requests through; closed is normal.
Idempotency: Repeating the same request produces the same effect (safe to retry).
Bulkheads: Separate resource pools so one failing dependency doesn’t sink the whole ship.
Graceful degradation: Offer a simpler but acceptable result when the ideal path fails.
Deadline propagation: Pass the remaining time budget to downstream services so everyone cooperates.

Deep dive: Timeouts vs. deadlines vs. cancellation

Per-call timeout: Maximum time to spend on one operation.
Deadline: Absolute time when the entire request must finish; propagate to downstream calls.
Cancellation: Stop work when the client times out or disconnects; free resources immediately.

Core patterns and tools

Retry policy: exponential backoff (e.g., base 100–200 ms) + jitter; cap retries and total time.
Per‑dependency timeouts: set based on observed p95–p99 latencies with safety margin.
Circuit breaker: failure rate and slow call rate thresholds; time window; half‑open probing.
Idempotency: keys, unique constraints, upserts, put-if-absent, or dedupe stores with TTL.
Bulkheads: separate thread pools/connection pools per dependency.
Graceful degradation: cached data, defaults, stale‑while‑revalidate, hide non‑critical widgets.
Hedging (advanced): duplicate a request after a delay to reduce tail latency; use only for idempotent reads.

Recommended sensible defaults

Client-to-service HTTP timeout: 2–5 s for user requests, 10–30 s for batch jobs.
External API calls: 1–3 s timeout; maximum 1–2 retries with backoff + jitter.
Backoff: 100 ms, then 300 ms, then 900 ms (cap at ~1–2 s), add ±50% jitter.
Circuit breaker: open on ≥50% failures over 20–50 requests or slow-calls > p99 threshold; half‑open after 10–30 s.
Dedupe TTL: align to business window (e.g., 24 h for payment idempotency keys).

Worked examples

1) Payment provider is flaky for a few seconds

Set per‑attempt timeout: 1.5 s (provider p95 ≈ 700 ms, p99 ≈ 1.2 s).
Retries: at most 1 (with exponential backoff 200 ms ± jitter).
Idempotency: require a client‑generated idempotency key; store result keyed by that value for 24 h.
Outcome: either quick success, a single retry, or a fast fail within ~3 s worst case.

Why not more retries?

Every retry increases load on a struggling provider and lengthens user wait time. A single retry recovers transient hiccups without creating a retry storm.

2) Circuit breaker for a slow inventory service

Open the breaker if failure rate ≥ 50% or if ≥ 60% calls exceed 800 ms over the last 50 requests.
When open: short‑circuit requests and serve cached inventory for 30 s.
Half‑open after 30 s: allow 5 trial requests; if successful, close breaker; otherwise re‑open.

Graceful degradation plan

Display stock status from cache; disable “real‑time availability” badge.
Record a metric tag "degraded=true" for observability.

3) Fan‑out aggregator with a timeout budget

Overall request budget: 1,000 ms. Local processing needs ≈ 150 ms.
Remaining 850 ms split across services A, B, C. Allocate 250 ms each + 100 ms slack.
Propagate a deadline to A/B/C; cancel downstream work when the deadline expires.
Return partial results if any service times out; mark fields as "partial".

Self-check

Do you avoid starting work that cannot finish before the deadline?
Do you release threads and connections immediately on timeout?

4) Idempotent job worker

Before processing job J with key K, attempt put-if-absent(K) in a fast store (TTL 24 h).
If present, skip (duplicate). If absent, process then mark result status under K.
On crash mid‑process: job will retry; put-if-absent prevents duplicate side effects if you reach the side‑effect step again.

Common mistakes (and how to self‑check)

No timeouts: Hanging calls pile up threads and sockets. Self‑check: grep configs for missing timeouts; ensure every I/O has explicit timeouts.
Unlimited retries: Causes retry storms. Self‑check: enforce max retries and total time budget.
Non‑idempotent retries: Duplicated charges/orders. Self‑check: require idempotency keys or unique constraints for side‑effects.
Single shared pool: One slow dependency consumes all threads. Self‑check: separate pools (bulkheads) per dependency.
No deadline propagation: Downstream keeps working after the client left. Self‑check: pass cancellation/deadline contexts.
Ignoring partial responses: Fail closed when a partial would have worked. Self‑check: clearly define acceptable degraded outputs.

Exercises

Hands-on practice. You can compare with the solutions at the end of each exercise.

Exercise 1: Design a timeout and retry policy for an email API

Constraints:

Observed latencies: p50=120 ms, p95=400 ms, p99=800 ms.
User request budget: 1,200 ms total; other work needs 400 ms.
API is idempotent if you pass an Idempotency-Key.

Task: Propose per‑attempt timeout, number of retries, backoff schedule, jitter, and how you will ensure idempotency. Keep within budget.

Show solution

Budget for email call: ~800 ms (1,200 − 400). Choose:

Per‑attempt timeout: 450 ms (covers p95, allows one retry).
Retries: 1 max.
Backoff: 150 ms ± 50% jitter.
Total worst‑case: 450 + 150 + 450 = 1,050 ms (fits 800 ms if you start early; if tight, drop timeout to 400 ms).
Include Idempotency-Key header; store send result keyed by that value for 24 h.

If you must hard‑cap at 800 ms: use per‑attempt 350 ms, backoff 100 ms, 1 retry → 350 + 100 + 350 = 800 ms.

Exercise 2: Make a worker idempotent

Your job processes invoices and charges a card. Retries can happen. Design idempotency so the customer is never double‑charged.

Inputs: invoice_id, amount, card_token
Payment gateway supports an idempotency key

Task: Outline the exact steps and data structures you will use.

Show solution

Compute key K = "charge:" + invoice_id.
Attempt put-if-absent(K) in a fast store (e.g., Redis) with TTL 48 h; value = "processing".
If present and status = "success", skip (already charged). If "processing", delay/backoff and retry.
Call gateway with Idempotency-Key=K and per‑attempt timeout.
On success, update store K → {status: "success", charge_id} and persist mapping in DB with a unique constraint on invoice_id.
On failure/timeouts: if retriable, retry; if terminal, set K → {status: "failed"}.

Checklist to validate your designs

Every external call has a per‑attempt timeout.
Retries are capped and include exponential backoff + jitter.
Idempotency keys and/or unique constraints prevent duplicate side‑effects.
Circuit breaker thresholds are defined and tested.
Deadline/cancellation is propagated to all downstream calls.
There is a clear degraded response plan.

Practical projects

Build a tiny aggregator service that calls two public mock endpoints; implement a 1 s deadline, partial responses, and a circuit breaker for one dependency.
Write a job worker that reads tasks from a queue, performs a side‑effect, and guarantees idempotency using a dedupe store + database unique constraint.
Add an optional hedging feature to a read‑only API client; enable only for idempotent GETs and measure p99 improvements in a local test harness.

Mini challenge

Your service has a 900 ms SLO. You fan‑out to two services in parallel, each with p99 350 ms. Propose a budget split, timeouts, and a fallback if one times out. Keep your solution within the 900 ms envelope.

One possible answer

Local work: 100 ms; remaining 800 ms.
Per-call timeouts: 350 ms each; allow 50 ms slack; no retries in the fan‑out path.
Deadline propagation: 800 ms budget passed downstream.
Fallback: return data from whichever service responds; mark missing fields as "unavailable" and log a degradation metric.

Learning path

Start with timeouts and retries (this page).
Add circuit breakers and bulkheads.
Implement idempotency for writes and jobs.
Practice deadline propagation and partial responses.
Evaluate advanced patterns (hedging, request collapsing) only after basics are stable.

Next steps

Apply these patterns to one real endpoint this week; ship behind a feature flag.
Instrument metrics: timeouts, retry counts, breaker state, and degraded responses.
Run a failure injection drill: simulate slow dependency and verify graceful degradation.

Check your understanding

The quick test for this subskill is available to everyone. Only logged‑in users will have their progress saved.

When you’re ready, use the Go to Quick Test action below.

Menu

Handling Failures And Timeouts

Table of Contents