Why this matters
As an API Engineer, you must protect services from abuse, spikes, and noisy neighbors while keeping good users fast and happy. Rate limiting is how you control bursts, enforce fair usage, and keep upstream systems stable. Youâll use it when rolling out new endpoints, designing pricing tiers, and defending shared resources like databases and third-party APIs.
- Real tasks: implement per-tenant limits, expose rate headers, return 429 with proper retry guidance, and prevent sudden spikes during promotions.
- Release safety: ship in shadow/monitoring mode, prove no regression, then enforce gradually.
Who this is for
- API/backend engineers adding limits to existing services.
- Developers building multi-tenant or public APIs.
- Engineers preparing for on-call: preventing outages caused by traffic spikes.
Prerequisites
- Comfort with HTTP status codes and headers.
- Basic familiarity with a key-value store (e.g., Redis) and atomic operations.
- Understanding of authentication/identity to choose the right limiting key (user, token, tenant, IP).
Concept explained simply
Rate limiting decides whether to allow or reject a request based on how many recent requests the identity (e.g., user or tenant) has made. You define limits like â100 requests per minute with bursts up to 20â. The limiter measures recent activity and either consumes a token (allow) or returns 429 Too Many Requests with guidance when the limit is reached.
Mental model
Think of a faucet and a bucket:
- The faucet drips tokens into a bucket at a fixed rate (average throughput).
- The bucket has a size (burst capacity).
- Each request removes one or more tokens (cost).
- If empty, you must wait until more tokens drip in (retry later).
Common algorithms (open)
- Fixed window: count requests per discrete window (e.g., per minute). Simple but has boundary burst issues.
- Sliding window (log or rolling): smoother than fixed window, tracks requests within a moving period.
- Token bucket: supports bursts with a defined average rate. Great for user-facing APIs.
- Leaky bucket: smooths traffic to a fixed outflow; good for protecting fragile downstreams.
Core building blocks
- Identity key: prioritize stable user/tenant IDs from auth; fallback to IP if anonymous. Avoid spoofable headers.
- Policy: rate (R/time), burst capacity, weights per endpoint (e.g., search=5 tokens, ping=1).
- Storage: centralized, low-latency store like Redis for counters/tokens. Use atomic ops or Lua scripts.
- Placement: gateway/proxy, service middleware, or both (edge for coarse, service for fine-grained).
- Feedback: 429 Too Many Requests; include Retry-After. Add rate headers so clients can adapt.
- Headers: expose limit/remaining/reset or average/burst info. Use consistent naming and units (seconds).
- Safety modes: monitor-only (log), soft limit (warn + slow), hard enforce (429).
Worked examples
Example 1: Fixed window in Redis (simple tier)
Goal: 100 requests per minute per user.
- Key: rate:{userId}:{currentMinuteEpoch}.
- INCR the key; if value == 1, set EXPIRE 60.
- If value > 100, return 429 with Retry-After: seconds to next minute.
Why this works / tradeoffs
Very simple and fast. But a user can send 100 req at the end of minute N and 100 at start of N+1 (burst at boundary). Good for dashboards/admin where slight burst is acceptable.
Example 2: Token bucket (smooth with bursts)
Goal: average 10 r/s with burst up to 20.
- Store: last_refill_ts and tokens for each user.
- On request: refill = (now - last_refill_ts) * 10; tokens = min(20, tokens + refill).
- If tokens < cost: reject 429 and compute Retry-After = (cost - tokens)/10.
- Else: tokens -= cost; allow.
When to pick
Use when you want consistent average rate but allow short bursts (e.g., UI bursts from page loads).
Example 3: Sliding window log (fairness across boundaries)
Goal: at most 300 requests in any rolling 5-minute window.
- Use Redis ZSET per user: score=timestamp, value=requestId.
- Remove entries older than now-300s.
- Check ZCARD; if >= 300, 429 with Retry-After equal to the time until the oldest relevant entry expires.
- Else ZADD new entry and allow.
Tradeoffs
More precise and fair than fixed window; costs more memory and CPU. Good for public APIs with strict SLAs.
Design choices youâll make
- Granularity: per API key, per user, per tenant, per IP, per route, or combinations.
- Weights: heavier endpoints consume more tokens.
- Global vs regional: apply both local and global caps to avoid regional hot spots.
- Shadow first: deploy monitor-only to collect violation stats before enforcing.
- Client feedback: choose headers and clear error messages to help clients auto-throttle.
Suggested headers format
- RateLimit-Limit: "10;w=1, 600;w=60" (example: 10/sec and 600/min)
- RateLimit-Remaining: remaining tokens for the most constrained window
- RateLimit-Reset: seconds until the most constrained window resets
- Retry-After: seconds until next allowed request (on 429)
Hands-on exercises
Do these after reading the examples. They match the exercises below.
- ex1: Implement a token bucket using a single Redis Lua script for atomicity. Include headers and 429 handling.
- ex2: Design tiered limits (Free/Pro/Enterprise) with endpoint weights and a fallback IP limit.
- Deliverables: working limiter or a precise design doc with data structures, keys, and headers.
- Timebox: 45â60 minutes each.
Checklist (self-check)
- Limits are enforced atomically under concurrency.
- Identity choice is stable (user/tenant) with safe IP fallback.
- 429 responses include Retry-After and consistent rate headers.
- Supports burst and steady rate as intended by policy.
- Shadow/monitor mode is possible via a flag or config.
- Metrics and logs show allows, throttles, and near-limit events.
Common mistakes & how to self-check
- Non-atomic counters: using INCR then EXPIRE can race. Self-check: run a high-concurrency test; counts should never exceed policy.
- Wrong identity: limiting by IP for authenticated users causes unfair throttling behind NATs. Self-check: verify you use a stable user/tenant key first.
- Missing feedback: 429 without Retry-After makes clients retry blindly. Self-check: inspect responses; ensure headers are present and correct.
- Window mismatch: mixing seconds and milliseconds. Self-check: unit tests asserting exact Retry-After math.
- Monolithic key: one global key throttles everyone. Self-check: keys include identity and optionally route.
- No weight model: heavy endpoints treated equal. Self-check: define token cost per endpoint and test it.
Practical projects
- Build a gateway middleware that enforces both a per-second token bucket and a per-day quota for each tenant.
- Add rate headers to an existing API and create a small client that adapts using Retry-After and jittered backoff.
- Protect a fragile downstream (e.g., search) with a leaky bucket at 50 r/s and measure tail latency improvements.
- Implement a shadow limiter mode that logs would-be 429s for a week, then gradually ramp to hard enforcement.
Learning path
- Start: implement a fixed window on one endpoint.
- Next: switch to token bucket with bursts and weights.
- Then: add distributed atomicity via Redis Lua and introduce shadow mode.
- Finally: add quotas (daily/monthly), dashboards, and alerts.
Next steps
- Instrument: expose metrics for allows, throttles, and near-limit events.
- Prepare runbooks: what SRE/on-call does when throttling spikes.
- Plan tenant tiers and document client guidance with examples.
Mini challenge
Design a policy for a multi-tenant API:
- Free: 60/min, burst 30
- Pro: 600/min, burst 120
- Enterprise: 3000/min, burst 600
Make the /export endpoint cost 10 tokens and /ping cost 1. How do headers look for a Pro user who just consumed 450 tokens within the last minute? Provide RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset, and sample 429 Retry-After if they exceed burst.
Assess your knowledge
Quick Test is available to everyone. Only logged-in users will have their progress saved.