Menu

Topic 5 of 8

Rate Limiting Implementation

Learn Rate Limiting Implementation for free with explanations, exercises, and a quick test (for API Engineer).

Published: January 21, 2026 | Updated: January 21, 2026

Why this matters

As an API Engineer, you must protect services from abuse, spikes, and noisy neighbors while keeping good users fast and happy. Rate limiting is how you control bursts, enforce fair usage, and keep upstream systems stable. You’ll use it when rolling out new endpoints, designing pricing tiers, and defending shared resources like databases and third-party APIs.

  • Real tasks: implement per-tenant limits, expose rate headers, return 429 with proper retry guidance, and prevent sudden spikes during promotions.
  • Release safety: ship in shadow/monitoring mode, prove no regression, then enforce gradually.

Who this is for

  • API/backend engineers adding limits to existing services.
  • Developers building multi-tenant or public APIs.
  • Engineers preparing for on-call: preventing outages caused by traffic spikes.

Prerequisites

  • Comfort with HTTP status codes and headers.
  • Basic familiarity with a key-value store (e.g., Redis) and atomic operations.
  • Understanding of authentication/identity to choose the right limiting key (user, token, tenant, IP).

Concept explained simply

Rate limiting decides whether to allow or reject a request based on how many recent requests the identity (e.g., user or tenant) has made. You define limits like “100 requests per minute with bursts up to 20”. The limiter measures recent activity and either consumes a token (allow) or returns 429 Too Many Requests with guidance when the limit is reached.

Mental model

Think of a faucet and a bucket:

  • The faucet drips tokens into a bucket at a fixed rate (average throughput).
  • The bucket has a size (burst capacity).
  • Each request removes one or more tokens (cost).
  • If empty, you must wait until more tokens drip in (retry later).
Common algorithms (open)
  • Fixed window: count requests per discrete window (e.g., per minute). Simple but has boundary burst issues.
  • Sliding window (log or rolling): smoother than fixed window, tracks requests within a moving period.
  • Token bucket: supports bursts with a defined average rate. Great for user-facing APIs.
  • Leaky bucket: smooths traffic to a fixed outflow; good for protecting fragile downstreams.

Core building blocks

  • Identity key: prioritize stable user/tenant IDs from auth; fallback to IP if anonymous. Avoid spoofable headers.
  • Policy: rate (R/time), burst capacity, weights per endpoint (e.g., search=5 tokens, ping=1).
  • Storage: centralized, low-latency store like Redis for counters/tokens. Use atomic ops or Lua scripts.
  • Placement: gateway/proxy, service middleware, or both (edge for coarse, service for fine-grained).
  • Feedback: 429 Too Many Requests; include Retry-After. Add rate headers so clients can adapt.
  • Headers: expose limit/remaining/reset or average/burst info. Use consistent naming and units (seconds).
  • Safety modes: monitor-only (log), soft limit (warn + slow), hard enforce (429).

Worked examples

Example 1: Fixed window in Redis (simple tier)

Goal: 100 requests per minute per user.

  1. Key: rate:{userId}:{currentMinuteEpoch}.
  2. INCR the key; if value == 1, set EXPIRE 60.
  3. If value > 100, return 429 with Retry-After: seconds to next minute.
Why this works / tradeoffs

Very simple and fast. But a user can send 100 req at the end of minute N and 100 at start of N+1 (burst at boundary). Good for dashboards/admin where slight burst is acceptable.

Example 2: Token bucket (smooth with bursts)

Goal: average 10 r/s with burst up to 20.

  1. Store: last_refill_ts and tokens for each user.
  2. On request: refill = (now - last_refill_ts) * 10; tokens = min(20, tokens + refill).
  3. If tokens < cost: reject 429 and compute Retry-After = (cost - tokens)/10.
  4. Else: tokens -= cost; allow.
When to pick

Use when you want consistent average rate but allow short bursts (e.g., UI bursts from page loads).

Example 3: Sliding window log (fairness across boundaries)

Goal: at most 300 requests in any rolling 5-minute window.

  1. Use Redis ZSET per user: score=timestamp, value=requestId.
  2. Remove entries older than now-300s.
  3. Check ZCARD; if >= 300, 429 with Retry-After equal to the time until the oldest relevant entry expires.
  4. Else ZADD new entry and allow.
Tradeoffs

More precise and fair than fixed window; costs more memory and CPU. Good for public APIs with strict SLAs.

Design choices you’ll make

  • Granularity: per API key, per user, per tenant, per IP, per route, or combinations.
  • Weights: heavier endpoints consume more tokens.
  • Global vs regional: apply both local and global caps to avoid regional hot spots.
  • Shadow first: deploy monitor-only to collect violation stats before enforcing.
  • Client feedback: choose headers and clear error messages to help clients auto-throttle.
Suggested headers format
  • RateLimit-Limit: "10;w=1, 600;w=60" (example: 10/sec and 600/min)
  • RateLimit-Remaining: remaining tokens for the most constrained window
  • RateLimit-Reset: seconds until the most constrained window resets
  • Retry-After: seconds until next allowed request (on 429)

Hands-on exercises

Do these after reading the examples. They match the exercises below.

  1. ex1: Implement a token bucket using a single Redis Lua script for atomicity. Include headers and 429 handling.
  2. ex2: Design tiered limits (Free/Pro/Enterprise) with endpoint weights and a fallback IP limit.
  • Deliverables: working limiter or a precise design doc with data structures, keys, and headers.
  • Timebox: 45–60 minutes each.

Checklist (self-check)

  • Limits are enforced atomically under concurrency.
  • Identity choice is stable (user/tenant) with safe IP fallback.
  • 429 responses include Retry-After and consistent rate headers.
  • Supports burst and steady rate as intended by policy.
  • Shadow/monitor mode is possible via a flag or config.
  • Metrics and logs show allows, throttles, and near-limit events.

Common mistakes & how to self-check

  • Non-atomic counters: using INCR then EXPIRE can race. Self-check: run a high-concurrency test; counts should never exceed policy.
  • Wrong identity: limiting by IP for authenticated users causes unfair throttling behind NATs. Self-check: verify you use a stable user/tenant key first.
  • Missing feedback: 429 without Retry-After makes clients retry blindly. Self-check: inspect responses; ensure headers are present and correct.
  • Window mismatch: mixing seconds and milliseconds. Self-check: unit tests asserting exact Retry-After math.
  • Monolithic key: one global key throttles everyone. Self-check: keys include identity and optionally route.
  • No weight model: heavy endpoints treated equal. Self-check: define token cost per endpoint and test it.

Practical projects

  • Build a gateway middleware that enforces both a per-second token bucket and a per-day quota for each tenant.
  • Add rate headers to an existing API and create a small client that adapts using Retry-After and jittered backoff.
  • Protect a fragile downstream (e.g., search) with a leaky bucket at 50 r/s and measure tail latency improvements.
  • Implement a shadow limiter mode that logs would-be 429s for a week, then gradually ramp to hard enforcement.

Learning path

  • Start: implement a fixed window on one endpoint.
  • Next: switch to token bucket with bursts and weights.
  • Then: add distributed atomicity via Redis Lua and introduce shadow mode.
  • Finally: add quotas (daily/monthly), dashboards, and alerts.

Next steps

  • Instrument: expose metrics for allows, throttles, and near-limit events.
  • Prepare runbooks: what SRE/on-call does when throttling spikes.
  • Plan tenant tiers and document client guidance with examples.

Mini challenge

Design a policy for a multi-tenant API:

  • Free: 60/min, burst 30
  • Pro: 600/min, burst 120
  • Enterprise: 3000/min, burst 600

Make the /export endpoint cost 10 tokens and /ping cost 1. How do headers look for a Pro user who just consumed 450 tokens within the last minute? Provide RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset, and sample 429 Retry-After if they exceed burst.

Assess your knowledge

Quick Test is available to everyone. Only logged-in users will have their progress saved.

Practice Exercises

2 exercises to complete

Instructions

Implement a token bucket limiter as a single Redis Lua script to ensure atomicity. Policy: 10 tokens per second (refill rate), capacity 20 tokens (burst). Each request costs 1 token.

  1. Inputs: key (user), now (ms). Stored fields: last_refill_ms, tokens.
  2. Refill tokens based on elapsed time; cap at capacity.
  3. If tokens < 1, return a value that indicates throttle and compute Retry-After in seconds (ceil).
  4. On allow, decrement tokens and return remaining plus time-to-full or reset seconds.
  5. Set an expiry slightly larger than the time to refill capacity (e.g., capacity/refill_rate * 2) to avoid orphan keys.
Deliverables
  • Lua script body
  • Example response mapping to HTTP: 200 or 429, with RateLimit headers and Retry-After
Expected Output
A working Lua script that atomically refills and decrements tokens; when 25 requests arrive instantly, ~20 allowed and ~5 rejected with Retry-After ≈ 0–1s. Headers include RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset.

Rate Limiting Implementation — Quick Test

Test your knowledge with 8 questions. Pass with 70% or higher.

8 questions70% to pass

Have questions about Rate Limiting Implementation?

AI Assistant

Ask questions about this tool