Menu

Topic 7 of 8

Circuit Breakers Basics

Learn Circuit Breakers Basics for free with explanations, exercises, and a quick test (for API Engineer).

Published: January 21, 2026 | Updated: January 21, 2026

Why this matters

Circuit breakers keep your APIs responsive when dependencies slow down or fail. As an API Engineer, you will:

  • Protect upstream services (auth, payments, search) from cascading failures.
  • Degrade gracefully with timeouts and fallbacks instead of timeouts piling up.
  • Recover quickly by probing a failing dependency safely.

Who this is for

  • API engineers and backend developers shipping services that call other services or third‑party APIs.
  • Platform/SRE folks designing reliability policies.

Prerequisites

  • Basic HTTP/REST and async request handling.
  • Know what timeouts and retries are (including exponential backoff).
  • Comfort reading pseudocode or simple config examples.

Concept explained simply

A circuit breaker wraps a call to a dependency and watches for failures/slow responses. It has three states:

  • Closed: Normal traffic flows. Errors are counted. If failures exceed a threshold in a window, it trips to Open.
  • Open: Calls fail fast (no waiting). After a cool-down period, it moves to Half‑Open.
  • Half‑Open: A small number of trial requests are allowed. If they succeed, the breaker closes; if not, it goes back to Open.

Key dials you choose:

  • Error/slow threshold (e.g., 50% failures or p95 latency > target).
  • Minimum number of calls before evaluating (e.g., 20 requests).
  • Open state duration (cool‑down) before probing (e.g., 30s).
  • Half‑Open probe count (e.g., 1–10 requests).
  • Fallback behavior (cached data, default response, or hard fail).

Mental model

Like an electrical breaker, it protects the house (your service) from drawing too much from a failing appliance (dependency). Trip early, stop the damage, then cautiously test if it’s safe again.

Worked examples

Example 1 — Latency spike in search service

Scenario: Your API calls /search. Normally p95 latency is 120 ms. During traffic surge, p95 grows to 2 s and timeouts cascade.

  • Breaker policy: closed if p95 <= 500 ms and error rate < 20% across last 50 calls.
  • Trip rule: if p95 > 500 ms OR error rate >= 20% for a window of 50 calls → Open for 45s.
  • Half‑Open: allow 5 trial requests; if 4+ succeed under 400 ms → Close, else Open 60s.
  • Fallback: return empty results with a hint flag in response; log event.

Outcome: Your API stays responsive, some users see empty results quickly instead of waiting 2 s+ and timing out.

Example 2 — Auth token introspection outage

Scenario: Token introspection endpoint is down for a few minutes.

  • Breaker: trip after 10 consecutive failures out of 20 requests; Open for 30s.
  • Fallback: if token is in local cache and unexpired → accept. Otherwise fail fast with 503.
  • Half‑Open: 1 probe request. If it succeeds twice in a row → Close.

Outcome: Known-good tokens continue; unknown tokens fail fast, protecting your service from lock-ups.

Example 3 — Flaky third‑party payments

Scenario: Payment provider intermittently times out.

  • Retry policy: at most 2 retries with exponential backoff (100 ms, 300 ms) and jitter.
  • Breaker threshold: if 50% of last 40 attempts fail/time out → Open for 20s.
  • Half‑Open: allow 3 probes; require all 3 to succeed < 600 ms.
  • Fallback: respond with 202 Accepted: “Payment processing delayed.” Queue for later processing.

Outcome: You avoid charging twice and avoid long user waits; operations can reconcile queued payments.

Design choices and defaults

  • When to measure failure: include timeouts and circuit-open rejections as failures; optionally include slow responses beyond SLO as failures.
  • Windows: rolling window by count (e.g., last 50 calls) is simple and stable; time windows (e.g., 10s) work well with high traffic.
  • Timeouts + retries + breaker: always set timeouts first; use limited retries with backoff; let the breaker trip if underlying health is bad.
  • Isolation: combine with bulkheads (separate thread pools) so a slow dependency can’t consume all resources.
  • Telemetry: record state transitions and reasons to help tuning.

Implementation sketch (pseudocode)

// CallWithBreaker(depCall, cfg)
if breaker.state == OPEN and now < breaker.nextProbeAt:
  return fallbackOrFastFail()

if breaker.state == OPEN and now >= breaker.nextProbeAt:
  breaker.state = HALF_OPEN
  breaker.allowedProbes = cfg.halfOpenProbes

result = depCall.withTimeout(cfg.timeout).withLimitedRetries(cfg.retries)
updateWindow(result)

if breaker.state == HALF_OPEN:
  if result.success and result.latency <= cfg.latencySLO:
    breaker.successProbes += 1
    if breaker.successProbes >= cfg.requiredProbeSuccesses:
      breaker.state = CLOSED
      resetMetrics()
  else:
    breaker.state = OPEN
    breaker.nextProbeAt = now + cfg.openDuration
    return fallbackOrFastFail()
else: // CLOSED
  if windowFailureRate() >= cfg.failureThreshold or windowP95() > cfg.latencyThreshold:
    breaker.state = OPEN
    breaker.nextProbeAt = now + cfg.openDuration
    return fallbackOrFastFail()

return result

Self-check checklist

  • I set timeouts lower than my client timeout budgets.
  • I defined what counts as a failure (errors, timeouts, slow responses).
  • I chose thresholds and windows that match traffic volume.
  • I have a clear fallback (or intentional fast-fail) per endpoint.
  • I log breaker state changes with context (service, reason, window stats).
  • I tested Half‑Open behavior under recovery.

Common mistakes and how to self-check

  • Too aggressive trips: If your breaker opens during minor blips, increase minimum calls per window or raise thresholds. Check logs for “opened with only few samples”.
  • Infinite retries: Limit retries; otherwise you overload the dependency faster.
  • No fallback: Decide per endpoint: cached data, default, queued work, or fail fast. Ensure user-visible messages are clear.
  • Shared pool exhaustion: Without bulkheads, threads get stuck even if breaker opens late. Use separate pools per dependency.
  • Forgetting slow=fail: If users care about latency, count slow responses as failures for breaker evaluation.

Learning path

  1. Set client timeouts and small, bounded retries.
  2. Add a simple failure-rate breaker on one dependency.
  3. Introduce latency-based trips and Half‑Open probes.
  4. Add fallbacks and telemetry for state transitions.
  5. Combine with bulkheads and load-shedding.
  6. Tune thresholds in staging load tests; then ship gradually.

Practical projects

  • Project 1: Wrap a slow dummy endpoint with a breaker. Simulate 50% failure and verify Open/Half‑Open transitions.
  • Project 2: Add a cache fallback for a product detail call. Measure user-perceived latency before/after.
  • Project 3: Build a dashboard panel showing breaker state, open durations, and reasons.

Exercises

Complete these before the quick test. Your answers are not auto-saved unless you are logged in.

Exercise 1 — Choose thresholds

Traffic: ~200 rps. Normal error rate 0.5%, p95 latency 150 ms. During incidents, error rate hits 30% and p95 1.5 s for a few minutes. Define a breaker config (window, thresholds, open/half‑open) that trips appropriately but avoids noise.

Exercise 2 — Fallback design

Your service calls a recommendation API on homepage render. If it’s slow or failing, you still want a fast page. Design a fallback that is safe and measurable.

  • Write your answer as a small config snippet or bullet list.
  • Sanity-check: would this trip during normal spikes? Can you explain each dial?

Mini challenge

Pick one critical dependency. Propose Closed→Open criteria, Open duration, and Half‑Open probe rules in three lines. Keep it production-practical.

Hint

Use: min 50 calls per window, 50% failures or p95 > target, open 30–60s, 3–5 probes, strict success criteria.

Next steps

  • Integrate breaker metrics into alerts (e.g., “breaker opened > X times in 10m”).
  • Run a load test that forces the breaker to open and recover; adjust thresholds.
  • Document fallbacks so product and support teams know user impact.

Note: The quick test is available to everyone. Only logged-in users have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Traffic: ~200 rps. Normal error rate 0.5%, p95 150 ms. Incident pattern: error rate spikes to 30% and p95 1.5 s for minutes. Propose a breaker config that trips during incidents but not during brief blips. Include: window type/size, failure/latency thresholds, min calls, open duration, half‑open probe rules, and fallback.

Expected Output
A compact config or bullet list covering window size, thresholds, open duration, half-open probes, and fallback behavior, with short justifications.

Circuit Breakers Basics — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Circuit Breakers Basics?

AI Assistant

Ask questions about this tool