How to learn Error Handling And Retries for Model Serving APIs in Machine Learning Engineer for free

Why this matters

In production, ML predictions fail for many reasons: brief network blips, overloaded services (HTTP 429), upstream outages (5xx), or slow models. A Machine Learning Engineer must design clients and services that handle these gracefully: retry safely, cap latency with timeouts, avoid duplicate charges or actions, and provide clear error messages for users and logs for ops.

Keep SLAs: return predictions within agreed latency or fail fast with a clear error.
Protect systems: avoid thundering herds by using backoff and jitter.
Preserve correctness: ensure POST requests are idempotent when retried.
Improve debuggability: log correlation IDs, attempts, and error categories.

Concept explained simply

Two big categories of failures:

Transient: likely to succeed if tried again soon (e.g., temporary network issues, 429 rate limit, some 5xx errors).
Persistent: will keep failing until something changes (e.g., 4xx client errors, invalid input schema, auth failures).

Retry policy says when and how to retry: which errors, how many attempts, delays between attempts, and when to stop.

Mental model

Picture a traffic light:

Green: Safe to retry (timeouts, 429 with Retry-After, 5xx like 502/503/504).
Yellow: Maybe retry with care (intermittent 500, gRPC UNAVAILABLE). Use limits and idempotency.
Red: Do not retry (400, 401, 403, 404, validation errors). Fix input or credentials.

Surround each request with guardrails:

Timeouts per request to cap worst-case latency.
Exponential backoff + jitter to spread retries and reduce load spikes.
Idempotency for side-effecting operations so duplicates don’t cause double actions.
Retry budget: cap total extra traffic caused by retries.
Circuit breaker: if downstream is consistently failing, stop hammering it and fail fast for a short period.

Key building blocks

Status codes: Retry 429, 502, 503, 504; avoid retry on 400, 401, 403, 404 (unless service uses 404 for eventual consistency use-cases).
Timeouts: Use per-call deadlines (e.g., 1–3s for online inference) rather than global timeouts.
Backoff: Exponential (e.g., base=200ms: 0.2s, 0.4s, 0.8s...) plus random jitter to avoid synchronized retries.
Max attempts: Typically 2–5; balance success odds vs. extra load and latency.
Idempotency keys: For POST, include a unique key (e.g., UUID) and let the server deduplicate.
Rate limiting: Respect Retry-After if present.
Observability: Log correlation/request IDs, attempt number, status code, latency, and final outcome.

Worked examples

Example 1: REST client with exponential backoff, jitter, and idempotency

import time, uuid, random, requests

RETRIABLE_STATUS = {429, 500, 502, 503, 504}

def predict_with_retry(url, payload, max_attempts=4, base_delay=0.2, timeout=2.0):
    idem_key = str(uuid.uuid4())  # prevent duplicate side effects on server
    headers = {"Idempotency-Key": idem_key, "Content-Type": "application/json"}

    attempt = 1
    while True:
        start = time.time()
        try:
            resp = requests.post(url, json=payload, headers=headers, timeout=timeout)
            latency = time.time() - start
            if resp.status_code < 400:
                print(f"OK attempt={attempt} latency={latency:.3f}s")
                return resp.json()
            # Decide if retriable
            if resp.status_code in RETRIABLE_STATUS:
                # Honor Retry-After if present
                retry_after = float(resp.headers.get("Retry-After", 0)) if resp.headers.get("Retry-After") else 0
            else:
                # Non-retriable
                raise RuntimeError(f"Non-retriable {resp.status_code}: {resp.text}")
        except requests.exceptions.Timeout:
            latency = time.time() - start
            print(f"Timeout attempt={attempt} latency={latency:.3f}s")
            retry_after = 0
        except requests.exceptions.RequestException as e:
            # Network error - treat as transient
            print(f"Network error attempt={attempt}: {e}")
            retry_after = 0

        # Retry path
        attempt += 1
        if attempt > max_attempts:
            raise RuntimeError("Gave up after retries")
        # Exponential backoff with full jitter
        exp_delay = base_delay * (2 ** (attempt - 2))
        delay = max(retry_after, random.uniform(0, exp_delay))
        print(f"Retrying in {delay:.2f}s (attempt {attempt}/{max_attempts})")
        time.sleep(delay)

Notes: per-call timeout, selective retries, jitter, and honoring Retry-After. Idempotency key ensures safe POST retries.

Example 2: gRPC with deadlines and selective retries (conceptual)

# Pseudocode illustrating gRPC concepts
import time, random
import grpc

RETRIABLE_CODES = {grpc.StatusCode.UNAVAILABLE, grpc.StatusCode.DEADLINE_EXCEEDED}

def grpc_call_with_retry(stub, request, max_attempts=4, base_delay=0.1, deadline_s=2.0):
    attempt = 1
    while True:
        try:
            # Per-RPC deadline
            response = stub.Predict(request, timeout=deadline_s)
            return response
        except grpc.RpcError as e:
            code = e.code()
            if code not in RETRIABLE_CODES or attempt >= max_attempts:
                raise
            attempt += 1
            delay = random.uniform(0, base_delay * (2 ** (attempt - 2)))
            time.sleep(delay)

Key ideas: per-RPC timeout (deadline), retry only on UNAVAILABLE/DEADLINE_EXCEEDED, and exponential backoff with jitter.

Example 3: Batch inference with partial failures and deduplication

# Process items in chunks; retry transient failures; record successes to avoid duplicates
import time, random

RETRIABLE = {"timeout", "rate_limit", "temp_5xx"}

seen_ids = set()  # simplistic dedup store

def process_chunk(items, call_predict, max_attempts=3, base_delay=0.2):
    results = {}
    for item in items:
        if item["id"] in seen_ids:
            continue
        attempt = 1
        while True:
            ok, value_or_err = call_predict(item)
            if ok:
                results[item["id"]] = value_or_err
                seen_ids.add(item["id"])  # dedup
                break
            else:
                err = value_or_err
                if err in RETRIABLE and attempt < max_attempts:
                    delay = random.uniform(0, base_delay * (2 ** (attempt - 1)))
                    time.sleep(delay)
                    attempt += 1
                else:
                    results[item["id"]] = {"error": err}
                    break
    return results

Chunking, selective retries, and a dedup record prevent reprocessing items after success.

Example 4: Simple circuit breaker (conceptual)

import time, random

class CircuitBreaker:
    def __init__(self, fail_threshold=5, reset_after_s=30):
        self.failures = 0
        self.open_until = 0
        self.fail_threshold = fail_threshold
        self.reset_after_s = reset_after_s

    def allow(self):
        return time.time() >= self.open_until

    def record_success(self):
        self.failures = 0

    def record_failure(self):
        self.failures += 1
        if self.failures >= self.fail_threshold:
            self.open_until = time.time() + self.reset_after_s

cb = CircuitBreaker()

def guarded_call(fn):
    if not cb.allow():
        raise RuntimeError("Circuit open; fail fast")
    try:
        result = fn()
        cb.record_success()
        return result
    except Exception:
        cb.record_failure()
        raise

When many consecutive failures occur, the breaker opens, preventing load spikes on an unhealthy dependency.

Step-by-step: Build a robust prediction client

Define error categories: Decide which status codes or exceptions are retriable.
Set per-call timeouts: Choose a deadline that fits your SLO (e.g., 95th percentile target).
Add exponential backoff + jitter: Start small (100–300ms), double each attempt, add randomness.
Cap attempts and total latency: E.g., 3–4 attempts, and a maximum overall time budget.
Use idempotency keys for POST; design the server to deduplicate.
Honor rate limits: Respect Retry-After; fail fast if budget is exceeded.
Log and tag: request ID, attempt number, latency, status, final disposition.

Exercises

Do these before the quick test. Tip: keep your retries bounded and your logs clear.

Exercise ex1: Implement predict_with_retry for a REST model endpoint with timeouts, exponential backoff + jitter, and idempotency keys. Retry on 429/5xx/timeouts; honor Retry-After; stop after 4 attempts.
Exercise ex2: Given a set of error logs, decide which should be retried and what the next delay should be, assuming base delay 200ms and attempt #2 (i.e., use up to 400ms backoff with jitter).

Exercise checklist

Per-call timeout is set and enforced for each request.
Only transient errors are retried; 4xx (except 429) are not retried.
Exponential backoff doubles per attempt; jitter is applied.
Max attempts or total time budget enforced.
Idempotency key included for POST requests.
Logs include attempt number, status/error, and latency.

Common mistakes (and self-check)

Retrying non-idempotent POST without safeguards

Risk: duplicate side effects (double billing, double writes). Self-check: do you include an Idempotency-Key and does the server deduplicate?

No jitter in backoff

Risk: synchronized retries causing load spikes. Self-check: is the delay randomized per attempt?

Global timeout instead of per-call deadline

Risk: long hangs and unpredictable latency. Self-check: does every call specify a timeout?

Unlimited retries

Risk: cascading failures and cost blowups. Self-check: are max attempts and retry budgets set?

Ignoring 429 Retry-After

Risk: immediate rejections and wasted attempts. Self-check: do you parse and honor Retry-After?

Mini challenge

Your online inference SLA is P95 < 800ms. Propose a retry plan for a single POST /predict call: timeout per attempt, number of attempts, and backoff pattern. Justify choices to meet SLA while maximizing success rate. Write your plan, then compare it to the examples above.

Who this is for

Machine Learning Engineers deploying real-time or batch inference services.
Data/Platform Engineers integrating models into production systems.
Backend Engineers consuming ML APIs and needing robust client logic.

Prerequisites

Basic HTTP or gRPC knowledge (status codes, request/response).
Comfort with Python or a similar language.
Understanding of your model’s latency profile and SLA goals.

Learning path

Start: Understand error categories and timeouts.
Implement: Add exponential backoff + jitter and idempotency to POSTs.
Harden: Respect rate limits, add retry budgets, logging, and correlation IDs.
Scale: Introduce circuit breaking and partial-failure handling for batch/streaming.
Validate: Load test, check P95/P99, and tune limits.

Practical projects

Wrap your model’s REST client with retries, timeouts, and idempotency. Measure success rate and latency distribution before/after.
Implement a simple circuit breaker and show how it reduces downstream error storms in a simulated outage.
Batch inference pipeline: chunk inputs, retry transient failures only, deduplicate successful items, and produce a final report of successes/errors.

Next steps

Run the exercises above and verify against the checklist.
Take the Quick Test to check understanding. Note: anyone can take it; only logged-in users get saved progress.
Apply these patterns to one real service you own. Start small: add timeouts and basic backoff, then iterate.

Menu

Error Handling And Retries

Table of Contents