Who this is for
Backend engineers who need reliable services under failure: third-party APIs, databases, caches, internal microservices, and worker queues.
- You own or touch service-to-service calls.
- You want to avoid cascading failures and noisy retries.
- You need predictable latency under partial outages.
Prerequisites
- Comfortable with HTTP or RPC calls and timeouts.
- Basic understanding of threads/connection pools.
- Can read pseudocode and JSON-like configs.
Why this matters
In real backend work, dependencies fail: payment gateways time out, databases slow down, and internal services deploy bad versions. Without protection, your service can pile up requests, exhaust threads, and go down—hurting SLAs and users.
- Protect checkout from a flaky payment provider.
- Keep search responsive even when recommendations are slow.
- Prevent thread/connection pool exhaustion and cascading failures.
Concept explained simply
Circuit breaker: watches calls to a dependency. If failure rate is high, it opens and short-circuits new calls for a cooldown. After a pause, it half-opens to test the waters with a few trial calls. If trials succeed, it closes; if not, it reopens.
Bulkhead: isolates resources (threads, connections, queues) per dependency or feature. If one area floods, others keep working.
Deep dive: common circuit breaker signals
- Failure rate threshold (e.g., 50% failures within a sliding window).
- Slow call threshold (e.g., calls slower than 800 ms count as slow).
- Minimum calls before evaluating (avoid opening on tiny samples).
- Open state wait (cooldown) and half-open trial count.
- Which errors count: timeouts, 5xx, connection errors; often ignore 4xx validation errors.
Mental model
- Circuit breaker = a safety switch. It trips when the line overheats (too many failures) and resets after cooling down.
- Bulkhead = watertight compartments in a ship. A leak in one compartment doesn’t sink the ship.
Worked examples
Example 1: Payment API protection
Goal: Prevent checkout timeouts when the payment provider has a partial outage.
// Pseudocode
cb = CircuitBreaker(
failureRateThreshold = 50, // %
slidingWindow = {type: "count", size: 20, minCalls: 10},
slowCallDurationMs = 800,
slowCallRateThreshold = 50,
openWaitMs = 30000, // 30s cooldown
halfOpenPermits = 3 // trial calls
)
with cb.guard():
resp = payClient.charge(request, timeoutMs=700)
return resp
onOpen:
enqueueForRetryLater(request) // fallback
return {status: "queued"}
Why it works: bounded timeout prevents request pileups; fallback avoids blocking the user; half-open probes recovery.
Example 2: Isolating recommendation service with bulkheads
Search page calls recommendations, which is sometimes slow. We isolate it so slow threads don’t block search:
// Two pools: web request pool and per-dependency pool
webThreads = 32
recoPool = ThreadPool(size=4, queue=10)
result = runIn(recoPool) { recoClient.get(timeoutMs=600) }
if result.timeoutOrRejected:
return pageWithoutReco // degrade gracefully
Outcome: Search stays responsive even if recommendations lag or saturate.
Example 3: Safe retries with jitter
Retries can amplify outages. Combine with circuit breakers and jitter:
for attempt in 1..3:
try:
return call(timeoutMs=500)
catch transient:
sleep( base=50ms, backoff=2^attempt, jitter=0-30ms )
if circuitBreaker.isOpen():
break
fallback()
Key: low retry counts, bounded timeouts, random jitter to avoid synchronized spikes, and stop when the breaker opens.
Exercises
Do these directly after reading. They mirror the graded exercises below.
Exercise 1: Configure a safe circuit breaker for checkout → payment
- Trigger open when failures ≥ 50% over the last 20 calls (evaluate after at least 10 calls or 5 seconds).
- Consider slow calls > 800 ms as failures if they exceed 50%.
- Stay open for 30 seconds, then half-open with 3 trial calls; close if ≥ 2 succeed.
- Count timeouts, 5xx, and connection errors; ignore 4xx validation errors.
- Fallback: queue the payment and notify the user that it’s being processed.
Produce a JSON-like config object.
Exercise 2: Design bulkheads for three dependencies
Service S depends on:
- Payments: spiky latency, occasional timeouts.
- Catalog: mostly reliable and fast.
- Recommendations: slow and non-critical.
Given an 8-core instance, propose thread pool sizes and queue limits per dependency, plus timeouts. Goal: keep core flows responsive if one dependency degrades.
Self-check checklist
- Did you set explicit timeouts for every dependency?
- Does the breaker have a minimum-call threshold?
- Are bulkhead pools separated for non-critical dependencies?
- Is there a user-facing or internal fallback path?
- Did you avoid unbounded queues?
Common mistakes (and how to self-check)
- No timeouts on I/O calls. Self-check: verify every client call has a concrete timeout smaller than your SLA budget.
- Opening the breaker on tiny samples. Self-check: ensure minCalls or minWindow time is set.
- Retry storms. Self-check: limit retries, add backoff + jitter, and stop when the breaker is open.
- Shared global pool for everything. Self-check: confirm critical and non-critical dependencies have separate pools/queues.
- Ignoring slow calls. Self-check: record slow-call ratio to catch brownouts, not just hard failures.
- Falling back to another slow dependency. Self-check: fallback must be local/cheap (cache, default, queue).
Practical projects
- Project 1: Wrap a simulated flaky HTTP endpoint with a circuit breaker and measure latency distribution before/after under load.
- Project 2: Build a page that calls two services (critical and optional). Use separate pools and show graceful degradation when the optional service slows.
- Project 3: Implement bounded retries with jitter and compare traffic during a synthetic outage with and without the breaker.
Learning path
- Before: Timeouts, retries, idempotency, and backoff.
- Now: Circuit breakers and bulkheads to prevent cascade failures.
- Next: Rate limiting, backpressure, hedged requests, and graceful degradation patterns.
Next steps
- Add metrics: track failure rate, slow-call rate, open/half-open durations, and rejection counts.
- Tune thresholds using real traffic percentiles (e.g., p95 latency as slow-call threshold start point).
- Document fallbacks and user messaging for degraded modes.
Mini challenge
Your service calls Inventory (critical), Pricing (critical), and Reviews (optional). Inventory is sometimes slow; Reviews is frequently slow; Pricing is stable. On 4 cores, propose breakers and bulkheads that keep checkout under 1.2 s p95 during an Inventory brownout. Write a short config and one-sentence fallback per dependency.
Quick test
Everyone can take this test for free. Log in if you want your progress to be saved automatically.