Menu

Topic 6 of 8

Retries Backoff And Timeouts

Learn Retries Backoff And Timeouts for free with explanations, exercises, and a quick test (for API Engineer).

Published: January 21, 2026 | Updated: January 21, 2026

Who this is for

This subskill is for API Engineers, backend developers, and SREs who design or call networked services and want reliable, fast, and cost-effective APIs.

Prerequisites

  • Comfortable with HTTP basics (methods, status codes)
  • Know how to make network calls in at least one language
  • Basic understanding of concurrency and errors/exceptions

Why this matters

Real-life API calls fail: transient network errors, overloaded services, and slow dependencies happen. Smart retries, backoff, and timeouts keep your system fast for users and cheap to run:

  • Reduce user-visible errors without hammering a struggling service
  • Meet SLAs/SLOs by bounding request time and avoiding cascading failures
  • Protect downstream systems with fairness and stability
Typical tasks you'll face
  • Choosing retry policies for HTTP 429 and 5xx responses
  • Setting timeouts to match end-to-end request deadlines
  • Implementing exponential backoff with jitter
  • Using idempotency keys to safely retry POST requests
  • Propagating deadlines across services

Concept explained simply

Retries: try again when a failure might be temporary. Backoff: wait longer between tries to avoid overload. Timeouts: stop waiting after a limit so work does not stall forever.

Mental model

Picture a sand timer for each call. When the sand runs out (timeout), you stop and optionally retry with a slightly bigger sand timer pause in between (backoff). You only retry when it’s safe and likely to help. You stop before your overall time budget is gone.

Key decisions you must make

  • Timeouts: per-try timeout vs overall deadline (time budget)
  • Retry policy: which errors to retry, max attempts, when to stop
  • Backoff: constant, exponential, or exponential with jitter; minimum and maximum delay
  • Idempotency: ensure safe retries for write operations (keys or idempotent ops)
  • Fairness and protection: honor Retry-After header, use circuit breakers, avoid thundering herds
  • Observability: log attempt number, error reason, delay, remaining deadline

Worked examples

Example 1: Calling a payment API

Goal: Fast checkout with safe retries for transient errors.

  • Retryable errors: HTTP 408, 429, 500, 502, 503, 504
  • Non-retryable: 400, 401, 403, 404, validation errors
  • Idempotency: Use an Idempotency-Key for POST /charge
  • Timeouts: overall deadline 3s; per-try timeout 900ms; 3 attempts max
  • Backoff: exponential with full jitter, base 100ms, cap 600ms
// Try timeline (rough):
// t=0ms: attempt #1 (timeout 900ms)
// if fail, wait random in [0..100]ms
// t~(900-1000)ms: attempt #2
// wait random in [0..200]ms
// t~(1800-2000)ms: attempt #3 (stop by 3s deadline)
Why this works

Idempotency avoids double charges. Jitter spreads load. Per-try timeout keeps room for later attempts before the 3s deadline.

Example 2: Worker consuming messages

Processing an item may fail due to a temporary dependency outage.

  • Policy: retry up to 5 times with exponential backoff and jitter
  • Backoff schedule (no jitter): 200ms, 400ms, 800ms, 1600ms, 1600ms (capped)
  • With full jitter, each delay becomes random in [0..delay]
  • Dead-letter after max attempts; include reason and last error
Tip

Persist attempt count and next-visibility time (or schedule) so restarts do not lose retry state.

Example 3: Fan-out service with a deadline

A request fans out to 3 downstream services. The client gives you a 1200ms deadline.

  • Reserve 100ms for your own work and margins
  • Budget ~1100ms for downstream calls total
  • Run them in parallel, each with per-try timeout 400ms and at most 1 retry if the overall deadline allows
  • Cancel remaining calls when the earliest successful response is obtained (if appropriate)
Outcome

You bound total latency, avoid wasting attempts when the deadline is near, and reduce tail latency.

Implementation patterns

Retry with exponential backoff + jitter (pseudocode)
function retryWithBackoff(op, isRetryable, opts) {
  // opts: maxAttempts, baseMs, factor, capMs, perTryTimeoutMs, jitter: 'none'|'full'|'equal'
  deadline = now() + opts.overallDeadlineMs
  for attempt in 1..opts.maxAttempts {
    remaining = deadline - now()
    if (remaining <= 0) throw DeadlineExceeded
    timeout = min(opts.perTryTimeoutMs, remaining)
    result = op(timeout)
    if (result.ok) return result
    if (!isRetryable(result.error) || attempt == opts.maxAttempts) throw result.error
    // compute backoff
    raw = min(opts.capMs, opts.baseMs * (opts.factor ** (attempt - 1)))
    delay = raw
    if (opts.jitter == 'full') delay = rand(0, raw)
    else if (opts.jitter == 'equal') delay = raw/2 + rand(0, raw/2)
    sleep(min(delay, deadline - now()))
  }
}
Deadline propagation
// Include remaining deadline in downstream requests, e.g. header "X-Deadline-Ms"
// Each service uses min(perTryTimeout, remainingDeadline) and stops attempts before deadline.
Idempotency keys for safe POST retries
// Client: generate a unique key per logical operation (e.g., payment attempt)
// Send header: Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
// Server: store key -> result; return the same result for duplicates.

Practical checklists

Per-call policy checklist
  • Defined overall deadline (client or service-level)
  • Per-try timeout leaves room for at least one retry
  • Retryable error list agreed (e.g., 408, 429, 5xx)
  • Max attempts set (commonly 2–4 for user traffic)
  • Backoff strategy (exponential) and jitter (full/equal)
  • Respect Retry-After (seconds or HTTP-date)
  • Idempotency guaranteed for writes
  • Metrics: attempt_count, last_error, total_latency
Service protection checklist
  • Rate limiters and quotas
  • Circuit breaker on persistent failures
  • Reasonable caps on backoff to prevent long tail
  • Timeouts shorter than upstream SLAs to avoid propagating slowness

Exercises

These mirror the graded tasks below. Do them here, then record your final answer in the exercise section.

Exercise 1: Safe-to-retry status codes

Scenario: A client performs GET and POST (with Idempotency-Key) requests. Choose which HTTP statuses are safe to retry.

  • Retry candidates: 400, 401, 404, 408, 409, 423, 425, 429, 500, 502, 503, 504
  • Assumptions: network errors and timeouts are retryable; server honors Idempotency-Key for POST.
Hint

Prefer retrying temporary/overload/timeouts; avoid client mistakes (4xx) except a few special cases (e.g., 408/425/429).

Exercise 2: Backoff schedule (no jitter)

Compute the per-attempt delays with exponential backoff, no jitter.

  • base = 100ms, factor = 2, cap = 1600ms, attempts = 5
Hint

Multiply by factor each attempt, but do not exceed the cap.

Common mistakes and how to self-check

  • Retrying non-idempotent operations without safeguards. Self-check: Do you use idempotency keys or a natural idempotent method?
  • No jitter. Self-check: Are your clients synchronized, causing spikes on retries?
  • Infinite or excessive retries. Self-check: Is there a hard max attempts or deadline?
  • Per-try timeout equals overall deadline. Self-check: Is there time left for another try?
  • Ignoring Retry-After. Self-check: Do you parse and respect it?
  • Retry storms during outages. Self-check: Do you have circuit breakers and backpressure?
  • Missing observability. Self-check: Can you see attempt counts and delay in logs/metrics?

Practical projects

  • Build a small HTTP client wrapper that implements retries with exponential backoff and full jitter, deadline propagation, and metrics.
  • Create a demo payment-like API that enforces idempotency keys and returns 429 with Retry-After. Write a client that respects it.
  • Implement a worker that processes jobs with retry, capped backoff, and a dead-letter queue. Include a dashboard of attempts and outcomes.

Learning path

  1. Start here: learn retry safety, timeouts, and jitter basics.
  2. Add idempotency for write operations.
  3. Introduce deadlines and circuit breakers for resilience.
  4. Instrument and tune based on SLOs and error budgets.

Next steps

  • Complete the exercises below and compare with the solutions.
  • Take the Quick Test to confirm understanding. The test is available to everyone; only logged-in users get saved progress.
  • Apply patterns in a small service and collect metrics for one week.

Mini challenge

In your main service, choose one critical outbound call. Define: retryable errors, max attempts (2–3), per-try timeout, overall deadline, and a jittered backoff. Deploy and verify latency/error metrics improved over 48 hours.

Practice Exercises

2 exercises to complete

Instructions

Decide which of the following HTTP status codes are safe to retry for:

  • GET requests
  • POST requests that include a valid Idempotency-Key

Candidate codes: 400, 401, 404, 408, 409, 423, 425, 429, 500, 502, 503, 504

Assume the server honors Idempotency-Key semantics for POST. Provide your two lists.

Expected Output
GET: 408, 425, 429, 500, 502, 503, 504. POST (with Idempotency-Key): 408, 425, 429, 500, 502, 503, 504. Others: do not retry.

Retries Backoff And Timeouts — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Retries Backoff And Timeouts?

AI Assistant

Ask questions about this tool