Who this is for
Backend engineers who run services that depend on other services or third-party APIs and must keep user-facing features usable during incidents, partial failures, or high load.
Prerequisites
- Basic HTTP, REST/gRPC, and status codes
- Understanding of timeouts, retries, and idempotency
- Familiarity with caching (memory/Redis) and message queues
Note: The quick test is available to everyone; only logged-in users get saved progress.
Why this matters
- Real tasks: keep checkout working when a payment provider is flaky; load product pages when recommendations are slow; serve defaults when profile images can’t be fetched.
- Business impact: revenue and trust drop sharply if the entire feature fails instead of gracefully degrading.
- Ops impact: graceful degradation reduces pager noise and gives SREs time to repair root causes.
Concept explained simply
Graceful degradation means your system delivers a simpler but acceptable experience when a dependency is slow, down, or overloaded. Instead of an error page, the user sees partial content, cached data, or a clear fallback.
Mental model
Think of your service as a building with fire doors. When one room has a problem, close that door (isolate) and keep the rest of the building open (serve partial functionality). Prefer fast, correct-enough results over slow or failing perfect results.
Key patterns
- Timeouts: fail fast rather than waiting. Choose slightly above normal p95 latency.
- Bounded retries with jitter: small number (e.g., 2) of retries for transient errors.
- Circuit breaker: open after failures to stop hammering a dead dependency; half-open to probe recovery.
- Bulkheads: isolate resources (threads/connections) per dependency so one does not starve others.
- Fallback data: cached/stale data, default values, or simplified computation.
- Partial responses: return what you have; mark optional sections as unavailable.
- Feature flags: quickly disable noncritical features under stress.
- Async deferral: accept the request but queue the heavy work; notify later or complete in background.
- Rate limiting/adaptive shedding: protect yourself by shedding lowest-priority work first.
Worked examples
Example 1 — Payment provider outage
Goal: Keep checkout usable.
- Place order record and reserve inventory locally.
- Call primary payment provider with a 2s timeout, 1 retry with jitter.
- If failing and circuit is open, queue a payment-capture job for delayed processing and show “Order placed, payment processing”.
- If you have a secondary provider, fail over; otherwise keep the order in a pending-payment state and send confirmation upon success.
// Pseudocode
if (circuit.closed and pay(primary)) ok
complete_order()
else if (secondary_available and pay(secondary)) ok
complete_order()
else
enqueue(capture_job)
mark_order_pending_payment()
return 202 Accepted
Example 2 — Recommendations are slow
- Try live recommendations with 150ms timeout.
- If timeout: use stale cache up to 30 minutes old; if none, hide the widget.
- Render the rest of the product page immediately.
Result: Page stays fast; recommendations degrade gracefully.
Example 3 — Avatar/image service down
- Try CDN url; timeout at 300ms.
- If fail: serve initials-based placeholder or locally cached last-known image.
- Log degradation with a cheap, non-blocking metric.
Example 4 — Reviews API error
- Fetch products from core service (critical).
- Fetch reviews (noncritical) with 200ms timeout, no retry.
- If fail: return product list without reviews; include a field like
reviews_available=false.
How to design fallbacks (step-by-step)
- Classify dependencies: critical vs noncritical; strong vs eventual consistency.
- Define SLO tiers: what can be partial/missing without violating user expectations.
- Set time budgets: allocate per-dependency timeouts to protect the overall p95.
- Choose a fallback: cached/stale, defaults, hide/disable, async deferral, or secondary provider.
- Guard: circuit breaker, bulkheads, and rate limiting.
- Observe: metrics for success/error, circuit state, degraded_mode=true counters.
- Test: chaos drills; inject faults and verify user experience and logs.
Implementation checklist
- Timeouts are lower than user request SLA and higher than normal p95.
- Retries are bounded (≤2) and only on idempotent, transient errors.
- Circuit breaker thresholds tuned; fallback paths covered by tests.
- Bulkhead limits per dependency to prevent resource starvation.
- Fallback data sources defined (cache, defaults, secondary provider).
- Degraded mode telemetry: counters, traces, and structured logs.
- Clear user messaging for partial features (no sensitive details exposed).
- Chaos test playbooks exist and are run regularly.
Exercises
The quick test is available to everyone; only logged-in users get saved progress.
Exercise 1 — Email API fallback design
Design how your service sends transactional emails when the primary email provider is down or slow. Include timeouts, retries, circuit breaker, queueing, and a secondary provider.
Exercise 2 — Product page degradation
Define exact degradation rules for a product page that calls: Pricing, Inventory, Reviews, and Recommendations.
Open checklist
- Have you set timeouts for each dependency?
- Did you separate critical vs noncritical features?
- Do you have a cache/default for each noncritical dependency?
- Is the circuit breaker behavior defined and observable?
- Do you log and meter degraded responses?
Common mistakes and self-check
- Unlimited retries causing storms. Self-check: are retries capped with jitter and backoff?
- Same timeout as user SLA leaving no time for fallbacks. Self-check: do you have headroom?
- Mixing critical and noncritical calls in one pool. Self-check: bulkheads per dependency?
- Fallbacks that are slower than primaries. Self-check: measure fallback latency.
- No observability for degraded mode. Self-check: metrics and logs include a degraded flag?
- Returning stale data without TTL. Self-check: do you bound staleness and annotate responses?
Practical projects
- Wrap a third-party API with timeouts, retries, and a circuit breaker, plus a cache-aside fallback.
- Add feature flags to disable optional widgets when a dependency p95 exceeds a threshold.
- Implement a background job queue for deferred work with idempotent handlers and dead-lettering.
- Create chaos scenarios that simulate latency spikes and verify that degraded mode triggers correctly.
Mini challenge
Your feed service depends on: Profiles (critical), Likes (noncritical), and Ads (noncritical revenue). Profiles are slow (p99 3s), Likes are erroring 20% of calls, Ads are fine. In 10 minutes, propose exact timeouts, retries, circuit breaker states, and what the user sees for each dependency. Keep total response under 600ms.
Next steps
- Instrument degraded mode metrics and alerts.
- Run a fault-injection drill and record what changed in user-visible behavior.
- Document a runbook: when to flip feature flags and where to find fallback logs.
Learning path
- Before this: Timeouts, Retries, Idempotency
- Now: Graceful Degradation and Fallbacks
- Next: Rate Limiting, Load Shedding, and Backpressure