Why this matters
Circuit breakers keep your APIs responsive when dependencies slow down or fail. As an API Engineer, you will:
- Protect upstream services (auth, payments, search) from cascading failures.
- Degrade gracefully with timeouts and fallbacks instead of timeouts piling up.
- Recover quickly by probing a failing dependency safely.
Who this is for
- API engineers and backend developers shipping services that call other services or third‑party APIs.
- Platform/SRE folks designing reliability policies.
Prerequisites
- Basic HTTP/REST and async request handling.
- Know what timeouts and retries are (including exponential backoff).
- Comfort reading pseudocode or simple config examples.
Concept explained simply
A circuit breaker wraps a call to a dependency and watches for failures/slow responses. It has three states:
- Closed: Normal traffic flows. Errors are counted. If failures exceed a threshold in a window, it trips to Open.
- Open: Calls fail fast (no waiting). After a cool-down period, it moves to Half‑Open.
- Half‑Open: A small number of trial requests are allowed. If they succeed, the breaker closes; if not, it goes back to Open.
Key dials you choose:
- Error/slow threshold (e.g., 50% failures or p95 latency > target).
- Minimum number of calls before evaluating (e.g., 20 requests).
- Open state duration (cool‑down) before probing (e.g., 30s).
- Half‑Open probe count (e.g., 1–10 requests).
- Fallback behavior (cached data, default response, or hard fail).
Mental model
Like an electrical breaker, it protects the house (your service) from drawing too much from a failing appliance (dependency). Trip early, stop the damage, then cautiously test if it’s safe again.
Worked examples
Example 1 — Latency spike in search service
Scenario: Your API calls /search. Normally p95 latency is 120 ms. During traffic surge, p95 grows to 2 s and timeouts cascade.
- Breaker policy: closed if p95 <= 500 ms and error rate < 20% across last 50 calls.
- Trip rule: if p95 > 500 ms OR error rate >= 20% for a window of 50 calls → Open for 45s.
- Half‑Open: allow 5 trial requests; if 4+ succeed under 400 ms → Close, else Open 60s.
- Fallback: return empty results with a hint flag in response; log event.
Outcome: Your API stays responsive, some users see empty results quickly instead of waiting 2 s+ and timing out.
Example 2 — Auth token introspection outage
Scenario: Token introspection endpoint is down for a few minutes.
- Breaker: trip after 10 consecutive failures out of 20 requests; Open for 30s.
- Fallback: if token is in local cache and unexpired → accept. Otherwise fail fast with 503.
- Half‑Open: 1 probe request. If it succeeds twice in a row → Close.
Outcome: Known-good tokens continue; unknown tokens fail fast, protecting your service from lock-ups.
Example 3 — Flaky third‑party payments
Scenario: Payment provider intermittently times out.
- Retry policy: at most 2 retries with exponential backoff (100 ms, 300 ms) and jitter.
- Breaker threshold: if 50% of last 40 attempts fail/time out → Open for 20s.
- Half‑Open: allow 3 probes; require all 3 to succeed < 600 ms.
- Fallback: respond with 202 Accepted: “Payment processing delayed.” Queue for later processing.
Outcome: You avoid charging twice and avoid long user waits; operations can reconcile queued payments.
Design choices and defaults
- When to measure failure: include timeouts and circuit-open rejections as failures; optionally include slow responses beyond SLO as failures.
- Windows: rolling window by count (e.g., last 50 calls) is simple and stable; time windows (e.g., 10s) work well with high traffic.
- Timeouts + retries + breaker: always set timeouts first; use limited retries with backoff; let the breaker trip if underlying health is bad.
- Isolation: combine with bulkheads (separate thread pools) so a slow dependency can’t consume all resources.
- Telemetry: record state transitions and reasons to help tuning.
Implementation sketch (pseudocode)
// CallWithBreaker(depCall, cfg)
if breaker.state == OPEN and now < breaker.nextProbeAt:
return fallbackOrFastFail()
if breaker.state == OPEN and now >= breaker.nextProbeAt:
breaker.state = HALF_OPEN
breaker.allowedProbes = cfg.halfOpenProbes
result = depCall.withTimeout(cfg.timeout).withLimitedRetries(cfg.retries)
updateWindow(result)
if breaker.state == HALF_OPEN:
if result.success and result.latency <= cfg.latencySLO:
breaker.successProbes += 1
if breaker.successProbes >= cfg.requiredProbeSuccesses:
breaker.state = CLOSED
resetMetrics()
else:
breaker.state = OPEN
breaker.nextProbeAt = now + cfg.openDuration
return fallbackOrFastFail()
else: // CLOSED
if windowFailureRate() >= cfg.failureThreshold or windowP95() > cfg.latencyThreshold:
breaker.state = OPEN
breaker.nextProbeAt = now + cfg.openDuration
return fallbackOrFastFail()
return resultSelf-check checklist
- I set timeouts lower than my client timeout budgets.
- I defined what counts as a failure (errors, timeouts, slow responses).
- I chose thresholds and windows that match traffic volume.
- I have a clear fallback (or intentional fast-fail) per endpoint.
- I log breaker state changes with context (service, reason, window stats).
- I tested Half‑Open behavior under recovery.
Common mistakes and how to self-check
- Too aggressive trips: If your breaker opens during minor blips, increase minimum calls per window or raise thresholds. Check logs for “opened with only few samples”.
- Infinite retries: Limit retries; otherwise you overload the dependency faster.
- No fallback: Decide per endpoint: cached data, default, queued work, or fail fast. Ensure user-visible messages are clear.
- Shared pool exhaustion: Without bulkheads, threads get stuck even if breaker opens late. Use separate pools per dependency.
- Forgetting slow=fail: If users care about latency, count slow responses as failures for breaker evaluation.
Learning path
- Set client timeouts and small, bounded retries.
- Add a simple failure-rate breaker on one dependency.
- Introduce latency-based trips and Half‑Open probes.
- Add fallbacks and telemetry for state transitions.
- Combine with bulkheads and load-shedding.
- Tune thresholds in staging load tests; then ship gradually.
Practical projects
- Project 1: Wrap a slow dummy endpoint with a breaker. Simulate 50% failure and verify Open/Half‑Open transitions.
- Project 2: Add a cache fallback for a product detail call. Measure user-perceived latency before/after.
- Project 3: Build a dashboard panel showing breaker state, open durations, and reasons.
Exercises
Complete these before the quick test. Your answers are not auto-saved unless you are logged in.
Exercise 1 — Choose thresholds
Traffic: ~200 rps. Normal error rate 0.5%, p95 latency 150 ms. During incidents, error rate hits 30% and p95 1.5 s for a few minutes. Define a breaker config (window, thresholds, open/half‑open) that trips appropriately but avoids noise.
Exercise 2 — Fallback design
Your service calls a recommendation API on homepage render. If it’s slow or failing, you still want a fast page. Design a fallback that is safe and measurable.
- Write your answer as a small config snippet or bullet list.
- Sanity-check: would this trip during normal spikes? Can you explain each dial?
Mini challenge
Pick one critical dependency. Propose Closed→Open criteria, Open duration, and Half‑Open probe rules in three lines. Keep it production-practical.
Hint
Use: min 50 calls per window, 50% failures or p95 > target, open 30–60s, 3–5 probes, strict success criteria.
Next steps
- Integrate breaker metrics into alerts (e.g., “breaker opened > X times in 10m”).
- Run a load test that forces the breaker to open and recover; adjust thresholds.
- Document fallbacks so product and support teams know user impact.
Note: The quick test is available to everyone. Only logged-in users have their progress saved.