Menu

Topic 4 of 8

Circuit Breakers And Bulkheads Basics

Learn Circuit Breakers And Bulkheads Basics for free with explanations, exercises, and a quick test (for Backend Engineer).

Published: January 20, 2026 | Updated: January 20, 2026

Who this is for

Backend engineers who need reliable services under failure: third-party APIs, databases, caches, internal microservices, and worker queues.

  • You own or touch service-to-service calls.
  • You want to avoid cascading failures and noisy retries.
  • You need predictable latency under partial outages.

Prerequisites

  • Comfortable with HTTP or RPC calls and timeouts.
  • Basic understanding of threads/connection pools.
  • Can read pseudocode and JSON-like configs.

Why this matters

In real backend work, dependencies fail: payment gateways time out, databases slow down, and internal services deploy bad versions. Without protection, your service can pile up requests, exhaust threads, and go down—hurting SLAs and users.

  • Protect checkout from a flaky payment provider.
  • Keep search responsive even when recommendations are slow.
  • Prevent thread/connection pool exhaustion and cascading failures.

Concept explained simply

Circuit breaker: watches calls to a dependency. If failure rate is high, it opens and short-circuits new calls for a cooldown. After a pause, it half-opens to test the waters with a few trial calls. If trials succeed, it closes; if not, it reopens.

Bulkhead: isolates resources (threads, connections, queues) per dependency or feature. If one area floods, others keep working.

Deep dive: common circuit breaker signals
  • Failure rate threshold (e.g., 50% failures within a sliding window).
  • Slow call threshold (e.g., calls slower than 800 ms count as slow).
  • Minimum calls before evaluating (avoid opening on tiny samples).
  • Open state wait (cooldown) and half-open trial count.
  • Which errors count: timeouts, 5xx, connection errors; often ignore 4xx validation errors.

Mental model

  • Circuit breaker = a safety switch. It trips when the line overheats (too many failures) and resets after cooling down.
  • Bulkhead = watertight compartments in a ship. A leak in one compartment doesn’t sink the ship.

Worked examples

Example 1: Payment API protection

Goal: Prevent checkout timeouts when the payment provider has a partial outage.

// Pseudocode
cb = CircuitBreaker(
  failureRateThreshold = 50,            // %
  slidingWindow = {type: "count", size: 20, minCalls: 10},
  slowCallDurationMs = 800,
  slowCallRateThreshold = 50,
  openWaitMs = 30000,                   // 30s cooldown
  halfOpenPermits = 3                   // trial calls
)

with cb.guard():
  resp = payClient.charge(request, timeoutMs=700)
  return resp
onOpen:
  enqueueForRetryLater(request) // fallback
  return {status: "queued"}

Why it works: bounded timeout prevents request pileups; fallback avoids blocking the user; half-open probes recovery.

Example 2: Isolating recommendation service with bulkheads

Search page calls recommendations, which is sometimes slow. We isolate it so slow threads don’t block search:

// Two pools: web request pool and per-dependency pool
webThreads = 32
recoPool = ThreadPool(size=4, queue=10)

result = runIn(recoPool) { recoClient.get(timeoutMs=600) }
if result.timeoutOrRejected:
  return pageWithoutReco // degrade gracefully

Outcome: Search stays responsive even if recommendations lag or saturate.

Example 3: Safe retries with jitter

Retries can amplify outages. Combine with circuit breakers and jitter:

for attempt in 1..3:
  try:
    return call(timeoutMs=500)
  catch transient:
    sleep( base=50ms, backoff=2^attempt, jitter=0-30ms )
    if circuitBreaker.isOpen():
      break
fallback()

Key: low retry counts, bounded timeouts, random jitter to avoid synchronized spikes, and stop when the breaker opens.

Exercises

Do these directly after reading. They mirror the graded exercises below.

Exercise 1: Configure a safe circuit breaker for checkout → payment

  1. Trigger open when failures ≥ 50% over the last 20 calls (evaluate after at least 10 calls or 5 seconds).
  2. Consider slow calls > 800 ms as failures if they exceed 50%.
  3. Stay open for 30 seconds, then half-open with 3 trial calls; close if ≥ 2 succeed.
  4. Count timeouts, 5xx, and connection errors; ignore 4xx validation errors.
  5. Fallback: queue the payment and notify the user that it’s being processed.

Produce a JSON-like config object.

Exercise 2: Design bulkheads for three dependencies

Service S depends on:

  • Payments: spiky latency, occasional timeouts.
  • Catalog: mostly reliable and fast.
  • Recommendations: slow and non-critical.

Given an 8-core instance, propose thread pool sizes and queue limits per dependency, plus timeouts. Goal: keep core flows responsive if one dependency degrades.

Self-check checklist
  • Did you set explicit timeouts for every dependency?
  • Does the breaker have a minimum-call threshold?
  • Are bulkhead pools separated for non-critical dependencies?
  • Is there a user-facing or internal fallback path?
  • Did you avoid unbounded queues?

Common mistakes (and how to self-check)

  • No timeouts on I/O calls. Self-check: verify every client call has a concrete timeout smaller than your SLA budget.
  • Opening the breaker on tiny samples. Self-check: ensure minCalls or minWindow time is set.
  • Retry storms. Self-check: limit retries, add backoff + jitter, and stop when the breaker is open.
  • Shared global pool for everything. Self-check: confirm critical and non-critical dependencies have separate pools/queues.
  • Ignoring slow calls. Self-check: record slow-call ratio to catch brownouts, not just hard failures.
  • Falling back to another slow dependency. Self-check: fallback must be local/cheap (cache, default, queue).

Practical projects

  • Project 1: Wrap a simulated flaky HTTP endpoint with a circuit breaker and measure latency distribution before/after under load.
  • Project 2: Build a page that calls two services (critical and optional). Use separate pools and show graceful degradation when the optional service slows.
  • Project 3: Implement bounded retries with jitter and compare traffic during a synthetic outage with and without the breaker.

Learning path

  • Before: Timeouts, retries, idempotency, and backoff.
  • Now: Circuit breakers and bulkheads to prevent cascade failures.
  • Next: Rate limiting, backpressure, hedged requests, and graceful degradation patterns.

Next steps

  • Add metrics: track failure rate, slow-call rate, open/half-open durations, and rejection counts.
  • Tune thresholds using real traffic percentiles (e.g., p95 latency as slow-call threshold start point).
  • Document fallbacks and user messaging for degraded modes.

Mini challenge

Your service calls Inventory (critical), Pricing (critical), and Reviews (optional). Inventory is sometimes slow; Reviews is frequently slow; Pricing is stable. On 4 cores, propose breakers and bulkheads that keep checkout under 1.2 s p95 during an Inventory brownout. Write a short config and one-sentence fallback per dependency.

Quick test

Everyone can take this test for free. Log in if you want your progress to be saved automatically.

Practice Exercises

2 exercises to complete

Instructions

Create a JSON-like configuration that:

  • Opens when failures ≥ 50% over the last 20 calls; evaluate only after at least 10 calls or a 5-second window.
  • Counts slow calls > 800 ms as failures if slow-call rate ≥ 50%.
  • Stays open for 30 seconds; half-opens with 3 trial calls; closes if ≥ 2 succeed.
  • Records timeouts, 5xx, and connection errors; ignores 4xx validation errors.
  • Fallback: queue-and-notify the user.
Expected Output
{ "name": "payment_api_cb", "failureRateThreshold": 50, "slidingWindow": {"type": "count", "size": 20, "minCalls": 10, "maxWindowTimeMs": 5000}, "slowCallDurationThresholdMs": 800, "slowCallRateThreshold": 50, "waitDurationInOpenStateMs": 30000, "permittedNumberOfCallsInHalfOpenState": 3, "halfOpenSuccessThreshold": 2, "recordedFailures": ["5xx", "timeout", "connection_error"], "ignoreExceptions": ["validation_error_4xx"], "fallback": "queue_and_notify" }

Circuit Breakers And Bulkheads Basics — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Circuit Breakers And Bulkheads Basics?

AI Assistant

Ask questions about this tool