Why this matters
As an API Engineer, your webhooks power real-time integrations: payments, user provisioning, audits, and alerts. Delivery guarantees determine how partners trust your platform. Poor guarantees cause duplicate charges, missing notifications, or out-of-order updates. Solid guarantees reduce support load, increase ecosystem reliability, and make your API easy to integrate.
- Real task: Design webhook retries so partners never miss events during outages.
- Real task: Prevent double-processing (e.g., duplicate invoices) using idempotency keys.
- Real task: Provide replay endpoints for partners to recover missed events.
Concept explained simply
Webhooks are outbound HTTP calls from your system to a subscriber's endpoint when something happens. Networks fail. Subscriber apps can be slow or down. Delivery guarantees define what you promise under these conditions.
Mental model
Think of each event as a package with a unique label (event_id). You drop it at a door (subscriber endpoint). If the door doesn't open quickly with a clear "got it" (2xx), you try again later. To avoid double-charging, the receiver should ignore duplicate labels. If packages must be opened in order, you deliver per-door in sequence.
Core delivery models and patterns
- At-most-once: Try once; no retries. Low overhead; risk of loss. Rarely acceptable for critical events.
- At-least-once (most common): Retry until acknowledged (2xx). May deliver duplicates; requires idempotency on receiver and/or sender.
- Exactly-once (practical approach): Achieved through "effectively-once" semantics using idempotency keys + dedup storage + retries. True exactly-once across systems is impractical; aim for idempotent effects.
Key building blocks
- Idempotency key: Stable unique key per event delivery (e.g., event_id or composite of event_id + subscription_id). Receiver stores processed keys to ignore duplicates.
- Retry policy: Exponential backoff with jitter and a max retry window (e.g., 72h) and dead-lettering after exhaustion.
- Timeouts: Short request timeout (e.g., 5–10s). If the receiver is slow, fail fast and retry.
- Acknowledgment rule: Only 2xx means success. 3xx/4xx/5xx or timeouts trigger retry. Consider 410 Gone for permanent unsubscribe.
- Ordering: Per-subscription queue ensures FIFO delivery. Pause next event until current is acknowledged or timed out. Be ready to relax ordering during replay windows if necessary.
- Security: Sign payloads (HMAC), include timestamp, and rotate secrets. Verify on receiver before acking.
- Durability: Persist events before attempting delivery; track delivery attempts with timestamps and status.
- Replay: Provide a way to request redelivery for a time range or event IDs.
Worked examples
Example 1 — Payment succeeded webhook
Goal: At-least-once delivery without double-charging.
- Sender persists event {id: evt_123, type: payment.succeeded}.
- Sender POSTs to subscriber with headers: X-Event-Id: evt_123, X-Signature: hmac, X-Delivery-Attempt: 1.
- Timeout at 8s; if not 2xx, schedule retry with backoff + jitter.
- Receiver verifies signature and checks idempotency store for evt_123. If new, applies business logic; if seen, responds 200 without side effects.
Result: Duplicate deliveries cause no duplicate charges; eventual success is achieved.
Example 2 — Ordering per subscription
Events: user.updated (E1), user.deleted (E2). Both for subscription S1.
- S1 has a per-subscription queue. E1 is in-flight. E2 waits.
- If E1 times out, mark attempt and retry later. Do not send E2 until E1 is resolved or max attempts exhausted and moved to dead-letter.
- After E1 is 2xx, send E2.
Result: Subscriber sees E1 before E2, preserving semantics.
Example 3 — Handling 410 Gone
Subscriber returns 410 Gone to signal permanent endpoint removal.
- On 410, sender stops retries for that subscription and marks as unsubscribed.
- Optionally send an internal alert and archive pending events for this subscriber.
Result: No wasted retries; clean unsubscribe flow.
Design checklist
- Events are durably stored before first delivery attempt.
- Clear 2xx-only ack rule; other statuses retry.
- Exponential backoff with jitter, max window defined (e.g., 72h).
- Per-subscription FIFO queues and concurrency limits.
- Stable idempotency key and dedup storage on receiver or sender.
- HMAC signatures with timestamp and allowable drift window.
- Delivery logs with attempt count, last error, and next attempt time.
- Replay endpoint and dead-letter review tooling.
Exercises
Complete these to solidify your understanding. The same exercises appear in the Exercises section below and your progress will be saved if you are logged in.
Exercise 1 — Define an at-least-once spec with idempotency
Write a short delivery spec for a critical webhook (e.g., order.created) that includes: ack rule, retry/backoff schedule, idempotency key format, security headers, and max retry window. Aim for 8–10 sentences or a concise bullet list.
Exercise 2 — Backoff schedule with jitter
Design a function that yields next retry timestamps for attempts 1–7 using exponential backoff (base 2), capped at 30 minutes, and adds ±20% jitter. Show a sample schedule for a failure starting at 12:00:00.
Need a nudge?
- For Exercise 1: Start from 2xx-only ack; use event_id + subscription_id as idempotency key.
- For Exercise 2: Backoff seconds min(30m, 2^(n-1) * 15s); jitter = random in [-20%, +20%].
Common mistakes and self-check
- Mistake: Treating 3xx or 4xx as success. Self-check: Only 2xx should stop retries.
- Mistake: No idempotency key. Self-check: Can the receiver safely ignore duplicates using a single stable key?
- Mistake: Infinite retries. Self-check: Do you have a clear max window with dead-letter?
- Mistake: Global ordering. Self-check: Are you enforcing ordering per subscription, not globally?
- Mistake: Long timeouts. Self-check: Are timeouts short enough to keep queues moving (e.g., 5–10s)?
- Mistake: Unverified signatures. Self-check: Do you verify HMAC and timestamp before processing?
Who this is for
- API Engineers building outbound integrations
- Backend developers responsible for event delivery
- Platform/SRE engineers improving reliability of webhooks
Prerequisites
- HTTP basics (methods, headers, status codes)
- Familiarity with message queues or background workers
- JSON and HMAC understanding
Practical projects
- Build a minimal webhook sender with: durable event store, per-subscription queues, retries with backoff, and HMAC signatures.
- Implement a receiver that verifies signatures, enforces idempotency, and exposes a /replay endpoint for specific event IDs.
- Create a dead-letter review page to retry or discard failed deliveries safely.
Learning path
- Start: Implement basic at-least-once with retries and 2xx-only ack.
- Add: Idempotency keys and dedup store on the receiver.
- Harden: Signatures, timestamps, and clock-skew checks.
- Scale: Per-subscription FIFO queues and concurrency controls.
- Operate: Delivery logs, dead-letter queues, and replay tools.
Next steps
- Extend to multi-tenant throttling to protect subscriber endpoints.
- Introduce event versioning and schema evolution in payloads.
- Document operational runbooks for incidents and replays.
Mini challenge
Design how your system should behave if a subscriber is down for 48 hours and then comes back. Specify max retry window, replay policy, and how you prevent flooding them on recovery.
Quick Test
Take the quick test below. Everyone can try it for free; only logged-in users get saved progress.