luvv to helpDiscover the Best Free Online Tools
Topic 5 of 8

Rate Limits And Error Handling

Learn Rate Limits And Error Handling for free with explanations, exercises, and a quick test (for Prompt Engineer).

Published: January 8, 2026 | Updated: January 8, 2026

Why this matters

As a Prompt Engineer, your prompts and pipelines often call model APIs. Those APIs enforce limits (requests per minute, tokens per minute, context size). Real-world apps must survive bursts of traffic, slow or flaky networks, and provider hiccups. Good rate limit strategy and error handling turn fragile prototypes into reliable systems.

  • Ship assistants that keep responding during traffic spikes.
  • Batch-process thousands of items without blowing quotas.
  • Prevent user-visible errors and duplicated charges.
  • Reduce costs by pacing tokens and caching results.

Concept explained simply

Imagine a water tap (your app) connected to a bucket with a narrow outlet (the API). The bucket empties at a steady speed (provider throughput). If you pour too fast, water overflows (429 errors). The solution is to pour steadily, and when overflow happens, pause and try again.

Mental model:
  • Plan: Know your provider limits (RPM, TPM, max tokens per request).
  • Pace: Control concurrency and add delays to stay within limits.
  • Protect: Retry transient errors with backoff, use idempotency keys, and degrade gracefully.
  • Prove: Log, measure, and alert so you can tune and trust your system.

Toolkit for rate limits and errors

1) Detect and classify errors
  • 429 Too Many Requests: You hit a limit. Respect Retry-After header if present.
  • 5xx Server Errors: Transient; safe to retry with backoff.
  • 408/Deadline/Timeouts: Treat as transient; retry with backoff.
  • 4xx (except 429): Usually client issues (bad params, auth). Do not blindly retry; fix input or fail fast.
2) Smart retries: exponential backoff with jitter

Retry quickly at first, then wait longer each attempt. Add random jitter to avoid thundering herds.

// Pseudocode
max_retries = 5
base = 0.5  // seconds
for attempt in 1..max_retries:
  resp = call_api()
  if success(resp): return resp
  if resp.status == 429 and resp.headers.has('Retry-After'):
    sleep(resp.headers['Retry-After'])
  else if is_transient(resp):
    jitter = random(0, base * 2 ** (attempt - 1))
    sleep(jitter)
  else:
    break
return fail()
3) Respect limits via pacing
  • Concurrency caps: Limit workers so overall requests per second (RPS) stays within RPM.
  • Token budgeting: Estimate tokens per call to keep TPM under the cap.
  • Queueing: Put tasks into a queue; workers pull at a controlled pace.
4) Idempotency and deduplication

Attach an idempotency key for each logical operation. If a retry occurs, the provider or your service returns the prior result instead of creating duplicates (like duplicate messages or charges).

5) Fallbacks and graceful degradation
  • Shorter prompts or smaller batches when limits are tight.
  • Cached answers or previously approved summaries for common queries.
  • Queue-and-notify: Accept the request and inform the user of slight delay.
6) Timeouts, circuit breaker, and budgets
  • Client timeout: Don’t wait forever; bound latency.
  • Circuit breaker: After repeated failures, pause calls for a cool-off period.
  • Retry budget: Cap total retries per time window to avoid traffic storms.
7) Telemetry
  • Log: status code, error type, latency, tokens used, attempt number.
  • Metrics: success rate, average wait, queue length, tokens/min, RPM.
  • Alerts: sustained 429s, rising latency, high error ratio.

Worked examples

Example 1: Chat API with 60 RPM

Constraint: 60 requests/minute (1 RPS). You have 5 workers.

  1. Set a concurrency gate so combined RPS ≤ 1. For 5 workers, allow each to proceed only when a shared token is available.
  2. On 429 without Retry-After, backoff with jitter (e.g., 0.5s, 1s, 2s, 4s...).
  3. Measure queue length; if backlog grows, surface a friendly delay message to users.

Result: Stable throughput, minimal 429s, predictable latency.

Example 2: Batch 10k items under 90k TPM

Estimate 1,200 tokens per request (prompt + output).

  • Max requests/min = floor(90,000 / 1,200) = 75.
  • Target RPS ≈ 1.25.
  • Total time ≈ 10,000 / 75 = 133.33 minutes (~2h13m).

Add pacing and retries; log tokens to validate the estimate and auto-adjust if actual usage differs.

Example 3: Graceful degradation for a spiky UI

During traffic spikes:

  • If queue length > threshold: send short draft answer now, stream full answer when ready.
  • If repeated 429s: switch to compact prompts (e.g., trimmed context), then re-run high-quality version in background.
  • Cache common responses to eliminate redundant calls.

Users see responsiveness instead of errors.

Practice exercises

Do these in a notebook or editor. Keep outputs readable. You can compare with the solutions below.

Exercise 1 — Resilient retry helper

Write a pseudocode function request_with_retries(call) that:

  • Retries up to 5 times on 429/5xx/timeouts.
  • Uses exponential backoff with full jitter.
  • Respects Retry-After when provided.
  • Accepts an idempotency_key parameter and passes it through.
  • Logs attempt, status, wait_time, and idempotency_key.
Hint

Keep a base delay (e.g., 0.5s). For attempt n, wait random(0, base * 2^(n-1)) when no Retry-After header is present.

Exercise 2 — Token pacing calculator

Given TPM = 90,000 and estimated 1,200 tokens/request for 10,000 items, compute:

  • Max requests per minute
  • Target RPS
  • Approximate total processing time
Hint

requests_per_min = floor(TPM / tokens_per_request). Time ≈ total_items / requests_per_min.

  • [ ] I ran Exercise 1 without syntax errors.
  • [ ] My retry logic stops on non-retryable 4xx.
  • [ ] I computed the pacing numbers for Exercise 2 and sanity-checked them.

Common mistakes and self-check

  • Retrying invalid requests (e.g., 400). Fix inputs; do not retry.
  • Using fixed sleeps. Prefer exponential backoff with jitter.
  • Ignoring Retry-After. Providers send it for a reason—respect it.
  • No idempotency. Retries may create duplicates.
  • Unlimited concurrency. A few fast workers can trigger mass 429s.
  • No timeouts. Hanging calls waste resources and block users.
  • Skipping logs/metrics. If you can’t see it, you can’t tune it.
Self-check
  • Do I know my RPM and TPM and where they’re enforced?
  • Can I show a log line with attempt, status, latency, and tokens used?
  • Can I cap overall RPS even with many workers?
  • Do I have a fallback when limits are tight?

Practical projects

  • Reliable Chat Microservice: Build a small service that accepts prompts, enforces RPM=60 and TPM=20k, and exposes logs/metrics for attempts and tokens.
  • Batch Summarizer: Process 5,000 documents under TPM=90k with resumable progress, retries, and a simple dashboard showing queue length and ETA.
  • Cache-First FAQ Bot: Cache common Q&A, fall back to the model when cache misses, and prove rate-limit resilience with a load test.

Mini challenge

Design a plan to process 25,000 rows tonight with provider limits of 120 RPM and 180k TPM, estimated 900 tokens/request. Include:

  • Max requests/min and target RPS.
  • Concurrency cap and queue strategy.
  • Retry and fallback policy for 429/5xx.
Possible approach

Max req/min = floor(180,000 / 900) = 200, but capped by RPM=120, so 120/min (2 RPS). Set concurrency so total RPS ≤ 2. Use exponential backoff with jitter, respect Retry-After, and switch to compact prompts under sustained 429s. Queue all rows and expose ETA.

Who this is for

  • Prompt Engineers turning prototypes into production services.
  • Data/ML engineers adding LLM calls to pipelines.
  • Developers responsible for reliability and cost control of LLM features.

Prerequisites

  • Basic HTTP concepts (status codes, headers, timeouts).
  • Familiarity with asynchronous tasks or worker queues.
  • Understanding of LLM token usage (prompt + output tokens).

Learning path

  1. Learn your provider limits and inspect error responses.
  2. Implement retry with backoff and idempotency keys.
  3. Add pacing: concurrency caps, RPM/RPS guards, and token budgeting.
  4. Introduce fallbacks and graceful degradation.
  5. Instrument logs/metrics; tune based on real data.

Next steps

  • Complete the exercises and mini challenge.
  • Build one Practical Project and measure improvements.
  • Take the Quick Test to check understanding. Note: the test is available to everyone; log in to save your progress.

Practice Exercises

2 exercises to complete

Instructions

Create a function request_with_retries(call, idempotency_key) that:

  • Retries up to 5 times on 429/5xx/timeout.
  • Uses exponential backoff with full jitter.
  • Respects Retry-After when provided.
  • Logs attempt, status, wait_time, and idempotency_key.

Use pseudocode or a real language. Keep it short and clear.

Expected Output
A runnable or clearly readable function that retries correctly, stops on non-retryable 4xx, respects Retry-After, and logs structured fields per attempt.

Rate Limits And Error Handling — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Rate Limits And Error Handling?

AI Assistant

Ask questions about this tool