Menu

Topic 5 of 8

Graceful Degradation And Fallbacks

Learn Graceful Degradation And Fallbacks for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Who this is for

Backend engineers who run services that depend on other services or third-party APIs and must keep user-facing features usable during incidents, partial failures, or high load.

Prerequisites

  • Basic HTTP, REST/gRPC, and status codes
  • Understanding of timeouts, retries, and idempotency
  • Familiarity with caching (memory/Redis) and message queues

Note: The quick test is available to everyone; only logged-in users get saved progress.

Why this matters

  • Real tasks: keep checkout working when a payment provider is flaky; load product pages when recommendations are slow; serve defaults when profile images can’t be fetched.
  • Business impact: revenue and trust drop sharply if the entire feature fails instead of gracefully degrading.
  • Ops impact: graceful degradation reduces pager noise and gives SREs time to repair root causes.

Concept explained simply

Graceful degradation means your system delivers a simpler but acceptable experience when a dependency is slow, down, or overloaded. Instead of an error page, the user sees partial content, cached data, or a clear fallback.

Mental model

Think of your service as a building with fire doors. When one room has a problem, close that door (isolate) and keep the rest of the building open (serve partial functionality). Prefer fast, correct-enough results over slow or failing perfect results.

Key patterns

  • Timeouts: fail fast rather than waiting. Choose slightly above normal p95 latency.
  • Bounded retries with jitter: small number (e.g., 2) of retries for transient errors.
  • Circuit breaker: open after failures to stop hammering a dead dependency; half-open to probe recovery.
  • Bulkheads: isolate resources (threads/connections) per dependency so one does not starve others.
  • Fallback data: cached/stale data, default values, or simplified computation.
  • Partial responses: return what you have; mark optional sections as unavailable.
  • Feature flags: quickly disable noncritical features under stress.
  • Async deferral: accept the request but queue the heavy work; notify later or complete in background.
  • Rate limiting/adaptive shedding: protect yourself by shedding lowest-priority work first.

Worked examples

Example 1 — Payment provider outage

Goal: Keep checkout usable.

  1. Place order record and reserve inventory locally.
  2. Call primary payment provider with a 2s timeout, 1 retry with jitter.
  3. If failing and circuit is open, queue a payment-capture job for delayed processing and show “Order placed, payment processing”.
  4. If you have a secondary provider, fail over; otherwise keep the order in a pending-payment state and send confirmation upon success.
// Pseudocode
if (circuit.closed and pay(primary)) ok
  complete_order()
else if (secondary_available and pay(secondary)) ok
  complete_order()
else
  enqueue(capture_job)
  mark_order_pending_payment()
  return 202 Accepted
Example 2 — Recommendations are slow
  1. Try live recommendations with 150ms timeout.
  2. If timeout: use stale cache up to 30 minutes old; if none, hide the widget.
  3. Render the rest of the product page immediately.

Result: Page stays fast; recommendations degrade gracefully.

Example 3 — Avatar/image service down
  1. Try CDN url; timeout at 300ms.
  2. If fail: serve initials-based placeholder or locally cached last-known image.
  3. Log degradation with a cheap, non-blocking metric.
Example 4 — Reviews API error
  1. Fetch products from core service (critical).
  2. Fetch reviews (noncritical) with 200ms timeout, no retry.
  3. If fail: return product list without reviews; include a field like reviews_available=false.

How to design fallbacks (step-by-step)

  1. Classify dependencies: critical vs noncritical; strong vs eventual consistency.
  2. Define SLO tiers: what can be partial/missing without violating user expectations.
  3. Set time budgets: allocate per-dependency timeouts to protect the overall p95.
  4. Choose a fallback: cached/stale, defaults, hide/disable, async deferral, or secondary provider.
  5. Guard: circuit breaker, bulkheads, and rate limiting.
  6. Observe: metrics for success/error, circuit state, degraded_mode=true counters.
  7. Test: chaos drills; inject faults and verify user experience and logs.

Implementation checklist

  • Timeouts are lower than user request SLA and higher than normal p95.
  • Retries are bounded (≤2) and only on idempotent, transient errors.
  • Circuit breaker thresholds tuned; fallback paths covered by tests.
  • Bulkhead limits per dependency to prevent resource starvation.
  • Fallback data sources defined (cache, defaults, secondary provider).
  • Degraded mode telemetry: counters, traces, and structured logs.
  • Clear user messaging for partial features (no sensitive details exposed).
  • Chaos test playbooks exist and are run regularly.

Exercises

The quick test is available to everyone; only logged-in users get saved progress.

Exercise 1 — Email API fallback design

Design how your service sends transactional emails when the primary email provider is down or slow. Include timeouts, retries, circuit breaker, queueing, and a secondary provider.

Exercise 2 — Product page degradation

Define exact degradation rules for a product page that calls: Pricing, Inventory, Reviews, and Recommendations.

Open checklist
  • Have you set timeouts for each dependency?
  • Did you separate critical vs noncritical features?
  • Do you have a cache/default for each noncritical dependency?
  • Is the circuit breaker behavior defined and observable?
  • Do you log and meter degraded responses?

Common mistakes and self-check

  • Unlimited retries causing storms. Self-check: are retries capped with jitter and backoff?
  • Same timeout as user SLA leaving no time for fallbacks. Self-check: do you have headroom?
  • Mixing critical and noncritical calls in one pool. Self-check: bulkheads per dependency?
  • Fallbacks that are slower than primaries. Self-check: measure fallback latency.
  • No observability for degraded mode. Self-check: metrics and logs include a degraded flag?
  • Returning stale data without TTL. Self-check: do you bound staleness and annotate responses?

Practical projects

  • Wrap a third-party API with timeouts, retries, and a circuit breaker, plus a cache-aside fallback.
  • Add feature flags to disable optional widgets when a dependency p95 exceeds a threshold.
  • Implement a background job queue for deferred work with idempotent handlers and dead-lettering.
  • Create chaos scenarios that simulate latency spikes and verify that degraded mode triggers correctly.

Mini challenge

Your feed service depends on: Profiles (critical), Likes (noncritical), and Ads (noncritical revenue). Profiles are slow (p99 3s), Likes are erroring 20% of calls, Ads are fine. In 10 minutes, propose exact timeouts, retries, circuit breaker states, and what the user sees for each dependency. Keep total response under 600ms.

Next steps

  • Instrument degraded mode metrics and alerts.
  • Run a fault-injection drill and record what changed in user-visible behavior.
  • Document a runbook: when to flip feature flags and where to find fallback logs.

Learning path

  • Before this: Timeouts, Retries, Idempotency
  • Now: Graceful Degradation and Fallbacks
  • Next: Rate Limiting, Load Shedding, and Backpressure

Practice Exercises

2 exercises to complete

Instructions

Your service sends order-confirmation emails via Provider A. Requirements: do not block checkout; attempt Provider B if A fails; if both fail, queue for retry. Set clear timeouts and retry rules. Describe:

  • Timeouts and retry counts (with backoff/jitter)
  • Circuit breaker thresholds and half-open probing
  • Queue schema fields for deferred sends
  • What the user sees immediately
  • Observability you will add
Expected Output
A concise design describing Provider A first, Provider B as failover, 1–2 bounded retries with jitter, 1–2s timeouts, circuit breaker open on consecutive failures, message queued with idempotency key, and user receives order success page while email is deferred.

Graceful Degradation And Fallbacks — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Graceful Degradation And Fallbacks?

AI Assistant

Ask questions about this tool