Why this matters
As a Prompt Engineer, your prompts and pipelines often call model APIs. Those APIs enforce limits (requests per minute, tokens per minute, context size). Real-world apps must survive bursts of traffic, slow or flaky networks, and provider hiccups. Good rate limit strategy and error handling turn fragile prototypes into reliable systems.
- Ship assistants that keep responding during traffic spikes.
- Batch-process thousands of items without blowing quotas.
- Prevent user-visible errors and duplicated charges.
- Reduce costs by pacing tokens and caching results.
Concept explained simply
Imagine a water tap (your app) connected to a bucket with a narrow outlet (the API). The bucket empties at a steady speed (provider throughput). If you pour too fast, water overflows (429 errors). The solution is to pour steadily, and when overflow happens, pause and try again.
- Plan: Know your provider limits (RPM, TPM, max tokens per request).
- Pace: Control concurrency and add delays to stay within limits.
- Protect: Retry transient errors with backoff, use idempotency keys, and degrade gracefully.
- Prove: Log, measure, and alert so you can tune and trust your system.
Toolkit for rate limits and errors
1) Detect and classify errors
- 429 Too Many Requests: You hit a limit. Respect Retry-After header if present.
- 5xx Server Errors: Transient; safe to retry with backoff.
- 408/Deadline/Timeouts: Treat as transient; retry with backoff.
- 4xx (except 429): Usually client issues (bad params, auth). Do not blindly retry; fix input or fail fast.
2) Smart retries: exponential backoff with jitter
Retry quickly at first, then wait longer each attempt. Add random jitter to avoid thundering herds.
// Pseudocode
max_retries = 5
base = 0.5 // seconds
for attempt in 1..max_retries:
resp = call_api()
if success(resp): return resp
if resp.status == 429 and resp.headers.has('Retry-After'):
sleep(resp.headers['Retry-After'])
else if is_transient(resp):
jitter = random(0, base * 2 ** (attempt - 1))
sleep(jitter)
else:
break
return fail()
3) Respect limits via pacing
- Concurrency caps: Limit workers so overall requests per second (RPS) stays within RPM.
- Token budgeting: Estimate tokens per call to keep TPM under the cap.
- Queueing: Put tasks into a queue; workers pull at a controlled pace.
4) Idempotency and deduplication
Attach an idempotency key for each logical operation. If a retry occurs, the provider or your service returns the prior result instead of creating duplicates (like duplicate messages or charges).
5) Fallbacks and graceful degradation
- Shorter prompts or smaller batches when limits are tight.
- Cached answers or previously approved summaries for common queries.
- Queue-and-notify: Accept the request and inform the user of slight delay.
6) Timeouts, circuit breaker, and budgets
- Client timeout: Don’t wait forever; bound latency.
- Circuit breaker: After repeated failures, pause calls for a cool-off period.
- Retry budget: Cap total retries per time window to avoid traffic storms.
7) Telemetry
- Log: status code, error type, latency, tokens used, attempt number.
- Metrics: success rate, average wait, queue length, tokens/min, RPM.
- Alerts: sustained 429s, rising latency, high error ratio.
Worked examples
Example 1: Chat API with 60 RPM
Constraint: 60 requests/minute (1 RPS). You have 5 workers.
- Set a concurrency gate so combined RPS ≤ 1. For 5 workers, allow each to proceed only when a shared token is available.
- On 429 without Retry-After, backoff with jitter (e.g., 0.5s, 1s, 2s, 4s...).
- Measure queue length; if backlog grows, surface a friendly delay message to users.
Result: Stable throughput, minimal 429s, predictable latency.
Example 2: Batch 10k items under 90k TPM
Estimate 1,200 tokens per request (prompt + output).
- Max requests/min = floor(90,000 / 1,200) = 75.
- Target RPS ≈ 1.25.
- Total time ≈ 10,000 / 75 = 133.33 minutes (~2h13m).
Add pacing and retries; log tokens to validate the estimate and auto-adjust if actual usage differs.
Example 3: Graceful degradation for a spiky UI
During traffic spikes:
- If queue length > threshold: send short draft answer now, stream full answer when ready.
- If repeated 429s: switch to compact prompts (e.g., trimmed context), then re-run high-quality version in background.
- Cache common responses to eliminate redundant calls.
Users see responsiveness instead of errors.
Practice exercises
Do these in a notebook or editor. Keep outputs readable. You can compare with the solutions below.
Exercise 1 — Resilient retry helper
Write a pseudocode function request_with_retries(call) that:
- Retries up to 5 times on 429/5xx/timeouts.
- Uses exponential backoff with full jitter.
- Respects Retry-After when provided.
- Accepts an idempotency_key parameter and passes it through.
- Logs attempt, status, wait_time, and idempotency_key.
Hint
Keep a base delay (e.g., 0.5s). For attempt n, wait random(0, base * 2^(n-1)) when no Retry-After header is present.
Exercise 2 — Token pacing calculator
Given TPM = 90,000 and estimated 1,200 tokens/request for 10,000 items, compute:
- Max requests per minute
- Target RPS
- Approximate total processing time
Hint
requests_per_min = floor(TPM / tokens_per_request). Time ≈ total_items / requests_per_min.
- [ ] I ran Exercise 1 without syntax errors.
- [ ] My retry logic stops on non-retryable 4xx.
- [ ] I computed the pacing numbers for Exercise 2 and sanity-checked them.
Common mistakes and self-check
- Retrying invalid requests (e.g., 400). Fix inputs; do not retry.
- Using fixed sleeps. Prefer exponential backoff with jitter.
- Ignoring Retry-After. Providers send it for a reason—respect it.
- No idempotency. Retries may create duplicates.
- Unlimited concurrency. A few fast workers can trigger mass 429s.
- No timeouts. Hanging calls waste resources and block users.
- Skipping logs/metrics. If you can’t see it, you can’t tune it.
Self-check
- Do I know my RPM and TPM and where they’re enforced?
- Can I show a log line with attempt, status, latency, and tokens used?
- Can I cap overall RPS even with many workers?
- Do I have a fallback when limits are tight?
Practical projects
- Reliable Chat Microservice: Build a small service that accepts prompts, enforces RPM=60 and TPM=20k, and exposes logs/metrics for attempts and tokens.
- Batch Summarizer: Process 5,000 documents under TPM=90k with resumable progress, retries, and a simple dashboard showing queue length and ETA.
- Cache-First FAQ Bot: Cache common Q&A, fall back to the model when cache misses, and prove rate-limit resilience with a load test.
Mini challenge
Design a plan to process 25,000 rows tonight with provider limits of 120 RPM and 180k TPM, estimated 900 tokens/request. Include:
- Max requests/min and target RPS.
- Concurrency cap and queue strategy.
- Retry and fallback policy for 429/5xx.
Possible approach
Max req/min = floor(180,000 / 900) = 200, but capped by RPM=120, so 120/min (2 RPS). Set concurrency so total RPS ≤ 2. Use exponential backoff with jitter, respect Retry-After, and switch to compact prompts under sustained 429s. Queue all rows and expose ETA.
Who this is for
- Prompt Engineers turning prototypes into production services.
- Data/ML engineers adding LLM calls to pipelines.
- Developers responsible for reliability and cost control of LLM features.
Prerequisites
- Basic HTTP concepts (status codes, headers, timeouts).
- Familiarity with asynchronous tasks or worker queues.
- Understanding of LLM token usage (prompt + output tokens).
Learning path
- Learn your provider limits and inspect error responses.
- Implement retry with backoff and idempotency keys.
- Add pacing: concurrency caps, RPM/RPS guards, and token budgeting.
- Introduce fallbacks and graceful degradation.
- Instrument logs/metrics; tune based on real data.
Next steps
- Complete the exercises and mini challenge.
- Build one Practical Project and measure improvements.
- Take the Quick Test to check understanding. Note: the test is available to everyone; log in to save your progress.