Why this matters
As an API Engineer, you rarely control all systems involved in a request. A user signup may create a profile, enqueue a welcome email, write analytics, and notify a CRM. Some steps will eventually fail. Your job is to ensure correctness, clarity to callers, and recoverability when only part of the work completes.
- Real tasks: design idempotent endpoints with retries; build bulk APIs that report per-item status; implement sagas and compensating actions; set up DLQs; return useful error details without leaking internals.
- Outcomes: fewer incidents, faster recovery, predictable client behavior, consistent data over time.
Who this is for
- API and backend engineers building multi-service workflows or integrating external providers.
- Developers responsible for reliability of bulk operations, webhooks, and asynchronous jobs.
Prerequisites
- HTTP basics (methods, status codes), JSON.
- Familiarity with REST or RPC, and queues or event streams.
- Understanding of retries, timeouts, and logging.
Concept explained simply
Partial failure means some steps of a multi-step operation succeed while others fail. Example: order record created, payment charged, but inventory reservation fails. You cannot just "roll back the world" across distributed systems. Instead, you design for eventual correction: compensate, retry safely, or surface partial success clearly.
Mental model
- Railway model: every forward step has a matching reverse switch (compensation) in case the next track is blocked.
- Two piles: atomic core (must be all-or-nothing) vs. peripheral effects (can be eventual). Protect the atomic core; make peripherals idempotent and retryable.
Core patterns you will use
Idempotency keys
Accept an Idempotency-Key (or dedupe token) so clients can safely retry. Server caches the first result and replays it for the same key. Store request hash + final outcome to detect accidental key reuse with different inputs.
Retries with backoff and jitter
- Use exponential backoff with jitter to avoid thundering herds.
- Bounded retries (e.g., 3-5 attempts) and total timeout budgets.
- Only retry safe, idempotent operations; do not retry non-idempotent effects without an idempotency strategy.
At-least-once vs exactly-once
Exactly-once end-to-end is rare. Aim for at-least-once delivery with idempotent consumers and deduplication by event ID or business key.
Outbox pattern (transactional messaging)
Write domain data and an "outbox" event in the same database transaction. A dispatcher reads the outbox and publishes to queues/webhooks. Prevents losing side effects when the publisher or network fails.
Saga pattern (compensating actions)
Long-running workflows across services: each step has a compensation. Orchestration (central coordinator) or choreography (events). Keep compensations side-effect-safe and idempotent.
Partial success API responses
- For bulk requests, return per-item statuses with error reasons.
- Use clear fields like success, error_code, retry_after_seconds. HTTP 200 or 207 can represent mixed results; be consistent in your API.
Circuit breakers and fallbacks
When a dependency is unhealthy, fail fast, degrade non-critical features, or queue work asynchronously.
Dead-letter queues (DLQ) and poison messages
After max retries, send to DLQ with context for human/automated remediation. Ensure reprocessing tools can safely replay.
Worked examples
Example 1 — Bulk import with partial success
Client uploads 100 contacts. Some rows are invalid; CRM is rate-limiting.
Request: POST /v1/contacts:bulk
Headers: Idempotency-Key: 932a... (UUID)
Body: { "items": [ {"id":"c1","email":"a@x"}, ... ] }
Response (mixed results): 207 Multi-Status
{
"summary": {"total":100, "succeeded":92, "failed":8},
"items": [
{"id":"c1","status":"ok"},
{"id":"c2","status":"failed","error_code":"INVALID_EMAIL","retry_after_seconds":0},
{"id":"c3","status":"retry","error_code":"RATE_LIMIT","retry_after_seconds":60}
]
}- Client can retry only the items marked retry after the suggested delay.
- Idempotency-Key guarantees rerunning the same batch won't duplicate records.
Example 2 — Order, payment, inventory (saga)
- Create Order (PENDING) and write Outbox event OrderCreated in same DB transaction.
- Inventory service reserves stock. If it fails, compensate by canceling the order.
- Payment service charges card. If charge succeeds but shipment creation fails, compensate with RefundPayment and ReleaseInventory.
Each step is idempotent by business key (order_id). Compensations check current state before acting, so repeated calls are safe.
Example 3 — Webhook delivery
- Delivery attempts: 0s, 5s, 30s, 2m (jitter ±20%).
- 3s timeout per attempt; stop after 6 tries; then DLQ with payload and headers.
- Receiver deduplicates by X-Event-ID. Your system treats any 2xx as success, 4xx non-retryable, 5xx retryable.
On each attempt: if 5xx or timeout -> retry; if 4xx -> stop and mark as failed; publish to DLQ.
Outbox ensures events aren't lost if delivery service crashes.Decision guide: allow partial success or enforce atomicity?
- Money movement or core invariants: prefer atomic boundaries (single DB transaction). If cross-service, design a saga with strong compensations.
- Fan-out notifications, analytics, emails: allow eventual completion; queue side effects.
- Bulk operations: return per-item results, never all-or-nothing unless required.
- External dependency unstable: degrade gracefully (queue work, 202 Accepted) instead of blocking users.
Implementation checklist
- Define idempotency strategy (key, dedupe store, response replay window).
- Classify errors: retryable vs non-retryable; attach error codes.
- Set bounded retries with backoff + jitter; add timeout budgets.
- Use outbox for side effects; ensure consumers are idempotent.
- Add DLQ and reprocessor tooling; store enough context to replay safely.
- Document response semantics for partial results and how clients should proceed.
- Emit metrics: success_rate, retry_rate, dlq_count, compensation_count.
Common mistakes and self-check
- Mistake: Retrying non-idempotent writes. Fix: require Idempotency-Key and dedupe by business key.
- Mistake: Infinite retries. Fix: bounded retries + DLQ + alerting.
- Mistake: Hiding partial success with 500. Fix: structured per-item statuses with clear error codes.
- Mistake: Compensations that double-apply. Fix: make compensations idempotent and state-aware.
- Mistake: Publishing messages outside the DB transaction. Fix: use outbox or transactional producer.
Self-check prompts
- Can the client safely retry the exact same request without duplication?
- If the publisher crashes after DB commit, is the side effect guaranteed to be sent?
- What lands in DLQ, and how will you reprocess it?
- Which errors are retried, and for how long?
Practical projects
- Build a bulk upsert API that returns per-item status and supports idempotency keys.
- Implement an outbox table and a dispatcher that publishes to a mock queue with retry and DLQ.
- Create a saga for order -> inventory -> payment with compensations and an audit log.
- Develop a webhook sender with exponential backoff, jitter, and dedup at the receiver.
Learning path
- Before this: HTTP status codes and error design; retries/timeouts; basic messaging.
- Now: Handling Partial Failures (this lesson).
- Next: Outbox and CDC, Saga orchestration vs choreography, Idempotent consumers, Observability for workflows.
Exercises
Complete the exercise below. Answers are available, but try first. The quick test at the end is available to everyone; only logged-in users get saved progress.
Exercise ex1 — Design a robust Create Invoice flow
Your API CreateInvoice writes to the billing DB, calls an external Tax service, and enqueues a webhook to the accounting system. Design for partial failures.
- Define: idempotency strategy, retry policy, timeouts, outbox usage, and response shape on partial success.
- Decide: what to do if DB write succeeds, Tax fails now, and webhook queue is degraded.
See expected outcome structure
- Idempotency-Key header + 24h replay window.
- DB write + Outbox (InvoiceCreated) in one transaction.
- Tax call: 3 retries with backoff/jitter, 2s timeout; on final failure, mark invoice as PENDING_TAX and schedule async retry.
- Webhook enqueue via outbox consumer; if queue down, keep retrying; use DLQ after N attempts.
- API synchronous response: 202 Accepted with per-effect status (db: ok, tax: pending, webhook: pending) and a status_url/job_id.
- Idempotency defined for client retries
- Retry budget and jitter specified
- Clear states for partial completion
- Outbox + DLQ path documented
- Response tells client what to do next
Mini challenge
Your service must upsert profiles in your DB and sync them to two third-party CRMs. One CRM is down for 15 minutes every night. How do you avoid blocking user writes?
Suggested approach
- Write profile + outbox in one transaction; return 202 to caller with operation status.
- Dispatchers push to CRMs with retries/backoff; use circuit breaker to pause the flaky CRM and queue work.
- Expose per-destination status; provide a replay tool for DLQ items.
Quick Test
Take the quick test below. Your progress is saved if you are logged in; otherwise you can still take it for free.
Next steps
- Refactor one of your existing APIs to accept Idempotency-Key and return per-item statuses.
- Add an outbox table and dispatcher to one service.
- Instrument retry, DLQ, and compensation metrics and set alerts.