Menu

Topic 5 of 8

Rate Limits And Quotas Design

Learn Rate Limits And Quotas Design for free with explanations, exercises, and a quick test (for API Engineer).

Published: January 21, 2026 | Updated: January 21, 2026

Who this is for

  • API engineers and backend developers who design, scale, and protect public or internal APIs.
  • Platform/SRE engineers building gateways, proxies, or edge protections.
  • Tech leads deciding on plans, fairness, and abuse prevention.

Prerequisites

  • HTTP basics: methods, status codes, headers.
  • Authentication and identity: API keys, OAuth tokens, service-to-service identities.
  • Familiarity with caches and distributed state (e.g., Redis) helps but not required.

Why this matters

Rate limits and quotas protect your API and users from overload, ensure fairness across tenants, and control costs. You will be expected to:

  • Define fair usage across free and paid plans.
  • Prevent abusive spikes and cascading failures.
  • Expose clear headers so clients can auto-throttle.
  • Provide predictable bursts for good UX while keeping backends safe.

Concept explained simply

Rate limit vs quota
  • Rate limit: short-term gate (e.g., 100 requests/min). Controls bursts.
  • Quota: longer-term allowance (e.g., 1 million requests/month). Controls total consumption.
Common identities (what you limit by)
  • Per API key / token
  • Per user or per organization/tenant
  • Per IP (often for public unauthenticated endpoints)
  • Per app/client_id (for OAuth)
Units you can limit
  • Requests per time window (e.g., rps, rpm)
  • Cost-based units (points). Example: GET = 1 point, POST = 2 points, heavy export = 50 points
  • Concurrency (in-flight requests), especially for expensive operations

Mental model

Think of a faucet connected to a bucket:

  • The bucket holds tokens (capacity = burst).
  • Tokens drip in at a steady rate (refill rate = sustained throughput).
  • Each request consumes tokens. If empty, requests wait or are rejected.

This is the token bucket model. It lets you support short bursts without exceeding a safe long-term rate.

Algorithms you will use

Fixed window

Count requests in discrete windows (e.g., per minute). Simple but susceptible to boundary spikes.

Sliding window

Count requests over the last N seconds continuously. Fairer than fixed window.

Leaky bucket

Enforces a nearly constant outflow rate. Smooths traffic but less flexible for bursts.

Token bucket

Allows bursts up to bucket size while maintaining average rate via token refill. Common and practical.

Design steps (use this checklist)

  • Identify actors: IP, user, token, org, or app?
  • Pick units: requests, points, or concurrency.
  • Choose limits: sustained rate + burst + longer-term quota.
  • Select algorithm: token bucket or sliding window for fairness.
  • Scope: per-endpoint, per-route group, or global per tenant?
  • Signals: status 429, Retry-After, RateLimit headers, idempotency guidance.
  • Enforcement location: gateway, service, or both (defense in depth).
  • Observability: counters, shed logs, per-tenant dashboards, alerts.
  • Grace and upgrades: soft warnings, burst credits, plan upgrades.
  • Docs for clients: how to backoff, what headers mean.

API signaling

  • Use HTTP 429 Too Many Requests when a limit is exceeded.
  • Include Retry-After to tell clients when to retry (seconds or HTTP date).
  • Expose remaining budget via headers such as:
    • RateLimit-Limit: total allowed in the window
    • RateLimit-Remaining: remaining in the window
    • RateLimit-Reset: seconds until reset

    If your stack prefers X-RateLimit-* naming, keep it consistent.

  • Document exponential backoff: e.g., 1s, 2s, 4s, up to a cap.

Worked examples

Example 1: Public read API
  • Identity: IP for unauthenticated users; token for authenticated users.
  • Limits: 60 rpm per IP; 600 rpm per token. Burst: 2x for up to 10 seconds.
  • Algorithm: token bucket with capacity = 2 * sustained rate / 6 (approx 20 tokens for IP; 200 for token).
  • On exceed: 429 + Retry-After seconds until at least 1 token refills.
  • Headers: RateLimit-Limit: 600; RateLimit-Remaining: 123; RateLimit-Reset: 8
Example 2: Cost-based write endpoints
  • Identity: per organization (org_id).
  • Units: points/min. POST = 2 points, PUT = 2, DELETE = 5, BULK_IMPORT = 50.
  • Limits: 200 points/min (burst 300). Monthly quota: 1,000,000 points.
  • Reason: protects DB from heavy writes while allowing light writes to pass.
Example 3: Multi-region service
  • Identity: token; global limit across regions.
  • Storage: shared Redis cluster for counters; TTL keyed by rolling window.
  • Fail-safe: if Redis unavailable, fall back to local conservative limit and log.
  • Observability: per-tenant dashboards and 95th percentile wait/429 rate.

Edge cases and safeguards

  • Fan-out endpoints (one request triggers many internal calls): charge a higher cost.
  • Long-running jobs: charge on enqueue, not completion. Offer idempotency keys.
  • Webhooks: if receivers 429, implement retry with exponential backoff and jitter.
  • Clock skew: prefer monotonic counters/epochs or server-side time only.
  • Fairness: avoid rate limits that punish multi-user orgs sharing one token; consider per-user within org plus org cap.
  • Security: combine with auth; do not rely on IP limits alone for abuse prevention.

Common mistakes and self-check

  • Mistake: Only a fixed window minute limit. Self-check: Can someone send 100 at :59 and 100 at 1:00? Consider sliding/token bucket.
  • Mistake: One-size-fits-all identity (just IP). Self-check: Do NATs or mobile carriers break fairness?
  • Mistake: No client guidance. Self-check: Do you return Retry-After and remaining budget headers?
  • Mistake: Ignoring writes vs reads cost. Self-check: Are heavy endpoints priced higher?
  • Mistake: Silent drops. Self-check: Always return 429 with clear reason; log for ops.
  • Mistake: Global only or local only. Self-check: Do you need a hybrid (global fairness + local protection)?

Practical projects

  • Build a token-bucket middleware that supports per-key limits and X-RateLimit-* headers.
  • Design a pricing/plan matrix: Free, Pro, Enterprise with different rate/quotas and burst credits.
  • Create a dashboard showing 429 rate by tenant, remaining quota, and top endpoints by cost.

Learning path

  • Start: HTTP status codes, headers, and caching semantics.
  • Then: Identity and auth (API keys, OAuth, service identities).
  • Now: Rate limits and quotas (this lesson).
  • Next: API reliability patterns (circuit breakers, retries, backoff, idempotency).

A mini framework to choose limits

Step 1: Define target SLO per tenant (e.g., 200 rps sustained across all tenants).
Step 2: Estimate backend capacity and reserve 30–40% headroom.
Step 3: Allocate per-plan budgets that sum below safe capacity.
Step 4: Set burst caps to absorb typical spikes (1–3x sustained for 5–20s).
Step 5: Price expensive endpoints as more points.
Step 6: Document headers, 429 behavior, and backoff.

Exercises

Work through these. You will see them referenced in the Quick Test.

Exercise 1: Design a fair multi-tenant limit

Scenario: You run a SaaS API with Free and Pro plans. Free tenants should get occasional bursts but not impact Pro tenants. Choose identities, sustained and burst rates, and headers.

  • Identity choice and why
  • Sustained limit and burst window for Free vs Pro
  • Monthly quota numbers
  • HTTP signaling (status and headers)

Write your answer as a short policy.

Exercise 2: Token bucket pseudocode

Write pseudocode for a per-token bucket with:

  • Capacity: 120 tokens, Refill: 2 tokens/sec
  • Cost: GET=1, POST=2, BULK=20
  • Output: allow/deny, remaining tokens, seconds to reset 1 token

Exercise completion checklist

  • You selected a fair identity (not just IP) and justified it.
  • You defined both sustained rate and burst.
  • You included quotas and signaling headers.
  • Your pseudocode shows atomic decrement and time-based refill.

Mini challenge

Propose rate limits for a search endpoint that is fast and a report export endpoint that is heavy. Include: identity, cost units, sustained vs burst, and headers. Keep it to 6–8 lines.

Next steps

  • Implement a prototype in your gateway or middleware with metrics.
  • Run load tests to tune burst and refill rates.
  • Add client-side retry with exponential backoff and jitter.

Quick Test

Available to everyone. Note: only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Design limits for Free vs Pro tenants in a SaaS API.

  • Identity: choose per-token, per-user, or per-org and justify.
  • Limits: sustained rpm/rps and burst for each plan.
  • Quota: monthly requests or points.
  • Signaling: 429 usage, Retry-After, and rate limit headers.

Deliver a 8–12 line policy.

Expected Output
A concise policy stating identity, sustained and burst limits, monthly quotas, and exact headers to return on nearing/exceeding limits.

Rate Limits And Quotas Design — Quick Test

Test your knowledge with 10 questions. Pass with 70% or higher.

10 questions70% to pass

Have questions about Rate Limits And Quotas Design?

AI Assistant

Ask questions about this tool