Why this matters
In production, ML predictions fail for many reasons: brief network blips, overloaded services (HTTP 429), upstream outages (5xx), or slow models. A Machine Learning Engineer must design clients and services that handle these gracefully: retry safely, cap latency with timeouts, avoid duplicate charges or actions, and provide clear error messages for users and logs for ops.
- Keep SLAs: return predictions within agreed latency or fail fast with a clear error.
- Protect systems: avoid thundering herds by using backoff and jitter.
- Preserve correctness: ensure POST requests are idempotent when retried.
- Improve debuggability: log correlation IDs, attempts, and error categories.
Concept explained simply
Two big categories of failures:
- Transient: likely to succeed if tried again soon (e.g., temporary network issues, 429 rate limit, some 5xx errors).
- Persistent: will keep failing until something changes (e.g., 4xx client errors, invalid input schema, auth failures).
Retry policy says when and how to retry: which errors, how many attempts, delays between attempts, and when to stop.
Mental model
Picture a traffic light:
- Green: Safe to retry (timeouts, 429 with Retry-After, 5xx like 502/503/504).
- Yellow: Maybe retry with care (intermittent 500, gRPC UNAVAILABLE). Use limits and idempotency.
- Red: Do not retry (400, 401, 403, 404, validation errors). Fix input or credentials.
Surround each request with guardrails:
- Timeouts per request to cap worst-case latency.
- Exponential backoff + jitter to spread retries and reduce load spikes.
- Idempotency for side-effecting operations so duplicates don’t cause double actions.
- Retry budget: cap total extra traffic caused by retries.
- Circuit breaker: if downstream is consistently failing, stop hammering it and fail fast for a short period.
Key building blocks
- Status codes: Retry 429, 502, 503, 504; avoid retry on 400, 401, 403, 404 (unless service uses 404 for eventual consistency use-cases).
- Timeouts: Use per-call deadlines (e.g., 1–3s for online inference) rather than global timeouts.
- Backoff: Exponential (e.g., base=200ms: 0.2s, 0.4s, 0.8s...) plus random jitter to avoid synchronized retries.
- Max attempts: Typically 2–5; balance success odds vs. extra load and latency.
- Idempotency keys: For POST, include a unique key (e.g., UUID) and let the server deduplicate.
- Rate limiting: Respect
Retry-Afterif present. - Observability: Log correlation/request IDs, attempt number, status code, latency, and final outcome.
Worked examples
Example 1: REST client with exponential backoff, jitter, and idempotency
import time, uuid, random, requests
RETRIABLE_STATUS = {429, 500, 502, 503, 504}
def predict_with_retry(url, payload, max_attempts=4, base_delay=0.2, timeout=2.0):
idem_key = str(uuid.uuid4()) # prevent duplicate side effects on server
headers = {"Idempotency-Key": idem_key, "Content-Type": "application/json"}
attempt = 1
while True:
start = time.time()
try:
resp = requests.post(url, json=payload, headers=headers, timeout=timeout)
latency = time.time() - start
if resp.status_code < 400:
print(f"OK attempt={attempt} latency={latency:.3f}s")
return resp.json()
# Decide if retriable
if resp.status_code in RETRIABLE_STATUS:
# Honor Retry-After if present
retry_after = float(resp.headers.get("Retry-After", 0)) if resp.headers.get("Retry-After") else 0
else:
# Non-retriable
raise RuntimeError(f"Non-retriable {resp.status_code}: {resp.text}")
except requests.exceptions.Timeout:
latency = time.time() - start
print(f"Timeout attempt={attempt} latency={latency:.3f}s")
retry_after = 0
except requests.exceptions.RequestException as e:
# Network error - treat as transient
print(f"Network error attempt={attempt}: {e}")
retry_after = 0
# Retry path
attempt += 1
if attempt > max_attempts:
raise RuntimeError("Gave up after retries")
# Exponential backoff with full jitter
exp_delay = base_delay * (2 ** (attempt - 2))
delay = max(retry_after, random.uniform(0, exp_delay))
print(f"Retrying in {delay:.2f}s (attempt {attempt}/{max_attempts})")
time.sleep(delay)
Notes: per-call timeout, selective retries, jitter, and honoring Retry-After. Idempotency key ensures safe POST retries.
Example 2: gRPC with deadlines and selective retries (conceptual)
# Pseudocode illustrating gRPC concepts
import time, random
import grpc
RETRIABLE_CODES = {grpc.StatusCode.UNAVAILABLE, grpc.StatusCode.DEADLINE_EXCEEDED}
def grpc_call_with_retry(stub, request, max_attempts=4, base_delay=0.1, deadline_s=2.0):
attempt = 1
while True:
try:
# Per-RPC deadline
response = stub.Predict(request, timeout=deadline_s)
return response
except grpc.RpcError as e:
code = e.code()
if code not in RETRIABLE_CODES or attempt >= max_attempts:
raise
attempt += 1
delay = random.uniform(0, base_delay * (2 ** (attempt - 2)))
time.sleep(delay)
Key ideas: per-RPC timeout (deadline), retry only on UNAVAILABLE/DEADLINE_EXCEEDED, and exponential backoff with jitter.
Example 3: Batch inference with partial failures and deduplication
# Process items in chunks; retry transient failures; record successes to avoid duplicates
import time, random
RETRIABLE = {"timeout", "rate_limit", "temp_5xx"}
seen_ids = set() # simplistic dedup store
def process_chunk(items, call_predict, max_attempts=3, base_delay=0.2):
results = {}
for item in items:
if item["id"] in seen_ids:
continue
attempt = 1
while True:
ok, value_or_err = call_predict(item)
if ok:
results[item["id"]] = value_or_err
seen_ids.add(item["id"]) # dedup
break
else:
err = value_or_err
if err in RETRIABLE and attempt < max_attempts:
delay = random.uniform(0, base_delay * (2 ** (attempt - 1)))
time.sleep(delay)
attempt += 1
else:
results[item["id"]] = {"error": err}
break
return results
Chunking, selective retries, and a dedup record prevent reprocessing items after success.
Example 4: Simple circuit breaker (conceptual)
import time, random
class CircuitBreaker:
def __init__(self, fail_threshold=5, reset_after_s=30):
self.failures = 0
self.open_until = 0
self.fail_threshold = fail_threshold
self.reset_after_s = reset_after_s
def allow(self):
return time.time() >= self.open_until
def record_success(self):
self.failures = 0
def record_failure(self):
self.failures += 1
if self.failures >= self.fail_threshold:
self.open_until = time.time() + self.reset_after_s
cb = CircuitBreaker()
def guarded_call(fn):
if not cb.allow():
raise RuntimeError("Circuit open; fail fast")
try:
result = fn()
cb.record_success()
return result
except Exception:
cb.record_failure()
raise
When many consecutive failures occur, the breaker opens, preventing load spikes on an unhealthy dependency.
Step-by-step: Build a robust prediction client
- Define error categories: Decide which status codes or exceptions are retriable.
- Set per-call timeouts: Choose a deadline that fits your SLO (e.g., 95th percentile target).
- Add exponential backoff + jitter: Start small (100–300ms), double each attempt, add randomness.
- Cap attempts and total latency: E.g., 3–4 attempts, and a maximum overall time budget.
- Use idempotency keys for POST; design the server to deduplicate.
- Honor rate limits: Respect Retry-After; fail fast if budget is exceeded.
- Log and tag: request ID, attempt number, latency, status, final disposition.
Exercises
Do these before the quick test. Tip: keep your retries bounded and your logs clear.
- Exercise ex1: Implement
predict_with_retryfor a REST model endpoint with timeouts, exponential backoff + jitter, and idempotency keys. Retry on 429/5xx/timeouts; honorRetry-After; stop after 4 attempts. - Exercise ex2: Given a set of error logs, decide which should be retried and what the next delay should be, assuming base delay 200ms and attempt #2 (i.e., use up to 400ms backoff with jitter).
Exercise checklist
- Per-call timeout is set and enforced for each request.
- Only transient errors are retried; 4xx (except 429) are not retried.
- Exponential backoff doubles per attempt; jitter is applied.
- Max attempts or total time budget enforced.
- Idempotency key included for POST requests.
- Logs include attempt number, status/error, and latency.
Common mistakes (and self-check)
Retrying non-idempotent POST without safeguards
Risk: duplicate side effects (double billing, double writes). Self-check: do you include an Idempotency-Key and does the server deduplicate?
No jitter in backoff
Risk: synchronized retries causing load spikes. Self-check: is the delay randomized per attempt?
Global timeout instead of per-call deadline
Risk: long hangs and unpredictable latency. Self-check: does every call specify a timeout?
Unlimited retries
Risk: cascading failures and cost blowups. Self-check: are max attempts and retry budgets set?
Ignoring 429 Retry-After
Risk: immediate rejections and wasted attempts. Self-check: do you parse and honor Retry-After?
Mini challenge
Your online inference SLA is P95 < 800ms. Propose a retry plan for a single POST /predict call: timeout per attempt, number of attempts, and backoff pattern. Justify choices to meet SLA while maximizing success rate. Write your plan, then compare it to the examples above.
Who this is for
- Machine Learning Engineers deploying real-time or batch inference services.
- Data/Platform Engineers integrating models into production systems.
- Backend Engineers consuming ML APIs and needing robust client logic.
Prerequisites
- Basic HTTP or gRPC knowledge (status codes, request/response).
- Comfort with Python or a similar language.
- Understanding of your model’s latency profile and SLA goals.
Learning path
- Start: Understand error categories and timeouts.
- Implement: Add exponential backoff + jitter and idempotency to POSTs.
- Harden: Respect rate limits, add retry budgets, logging, and correlation IDs.
- Scale: Introduce circuit breaking and partial-failure handling for batch/streaming.
- Validate: Load test, check P95/P99, and tune limits.
Practical projects
- Wrap your model’s REST client with retries, timeouts, and idempotency. Measure success rate and latency distribution before/after.
- Implement a simple circuit breaker and show how it reduces downstream error storms in a simulated outage.
- Batch inference pipeline: chunk inputs, retry transient failures only, deduplicate successful items, and produce a final report of successes/errors.
Next steps
- Run the exercises above and verify against the checklist.
- Take the Quick Test to check understanding. Note: anyone can take it; only logged-in users get saved progress.
- Apply these patterns to one real service you own. Start small: add timeouts and basic backoff, then iterate.