Rate Limiting Implementation Patterns#
Rate limiting controls how many requests a client can make within a time period. It protects services from overload, ensures fair usage across clients, prevents abuse, and provides a mechanism for graceful degradation under load. Every production API needs rate limiting at some layer.
Algorithm Comparison#
Fixed Window#
The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals) and count requests per window. When the count exceeds the limit, reject requests until the next window starts.
Window: 12:00:00 - 12:00:59 | Limit: 100 requests
Request at 12:00:30: counter = 57, allowed
Request at 12:00:45: counter = 101, rejected (429)
Window resets at 12:01:00: counter = 0Implementation with Redis:
import redis
import time
r = redis.Redis()
def fixed_window_check(client_id: str, limit: int, window_seconds: int) -> bool:
key = f"ratelimit:{client_id}:{int(time.time()) // window_seconds}"
pipe = r.pipeline()
pipe.incr(key)
pipe.expire(key, window_seconds)
count, _ = pipe.execute()
return count <= limitProblem: The boundary burst. A client can send 100 requests at 12:00:59 and 100 more at 12:01:00, effectively making 200 requests in 2 seconds while staying within the 100/minute limit on both windows. This is the primary weakness of fixed window.
Sliding Window Log#
Maintain a log of every request timestamp. When a new request arrives, remove entries older than the window duration and count the remaining entries. If the count exceeds the limit, reject the request.
def sliding_window_log_check(client_id: str, limit: int, window_seconds: int) -> bool:
key = f"ratelimit:{client_id}"
now = time.time()
window_start = now - window_seconds
pipe = r.pipeline()
# Remove entries outside the window
pipe.zremrangebyscore(key, 0, window_start)
# Add current request
pipe.zadd(key, {str(now): now})
# Count entries in window
pipe.zcard(key)
# Set expiry on the whole key
pipe.expire(key, window_seconds)
_, _, count, _ = pipe.execute()
return count <= limitAdvantage: No boundary burst problem. The window slides smoothly with time.
Disadvantage: Memory usage scales with request volume. For a high-traffic API, storing every timestamp is expensive. A client making 10,000 requests per minute requires 10,000 sorted set entries per client.
Sliding Window Counter#
A compromise between fixed window and sliding window log. Use two fixed windows and weight the counts based on how far into the current window the request arrives.
Previous window (12:00 - 12:01): 85 requests
Current window (12:01 - 12:02): 20 requests so far
Current time: 12:01:15 (25% into current window)
Estimated count = (85 * 0.75) + 20 = 83.75
Limit: 100 -> alloweddef sliding_window_counter_check(
client_id: str, limit: int, window_seconds: int
) -> bool:
now = time.time()
current_window = int(now) // window_seconds
previous_window = current_window - 1
elapsed = (now % window_seconds) / window_seconds
prev_key = f"ratelimit:{client_id}:{previous_window}"
curr_key = f"ratelimit:{client_id}:{current_window}"
pipe = r.pipeline()
pipe.get(prev_key)
pipe.incr(curr_key)
pipe.expire(curr_key, window_seconds * 2)
prev_count, curr_count, _ = pipe.execute()
prev_count = int(prev_count or 0)
weighted = prev_count * (1 - elapsed) + curr_count
return weighted <= limitAdvantage: Smooths the boundary burst, uses only two counters per client (constant memory), and is simple to implement.
Disadvantage: The count is an approximation. In practice, the approximation is close enough for rate limiting purposes.
Token Bucket#
The token bucket adds tokens at a fixed rate up to a maximum capacity. Each request consumes one token. If no tokens are available, the request is rejected. This allows bursts up to the bucket capacity while enforcing an average rate.
Bucket capacity: 10 tokens
Refill rate: 2 tokens/second
t=0: bucket=10, request costs 1 -> bucket=9, allowed
t=0: 9 more requests -> bucket=0, allowed
t=0: request -> bucket=0, rejected (no tokens)
t=1: 2 tokens added -> bucket=2
t=1: request -> bucket=1, alloweddef token_bucket_check(
client_id: str, capacity: int, refill_rate: float
) -> bool:
key = f"ratelimit:tb:{client_id}"
now = time.time()
# Atomic Lua script for token bucket
lua_script = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call('hmget', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Add tokens based on elapsed time
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + elapsed * refill_rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
redis.call('expire', key, math.ceil(capacity / refill_rate) * 2)
return 1
else
redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
redis.call('expire', key, math.ceil(capacity / refill_rate) * 2)
return 0
end
"""
result = r.eval(lua_script, 1, key, capacity, refill_rate, now)
return result == 1Advantage: Allows controlled bursts. A client that has been idle accumulates tokens and can burst up to the bucket capacity. This matches real-world traffic patterns better than strict windowed counting.
Disadvantage: Slightly more complex to implement correctly, especially in distributed systems. The Lua script approach ensures atomicity in Redis.
Algorithm Selection Guide#
| Algorithm | Burst Handling | Memory | Accuracy | Use Case |
|---|---|---|---|---|
| Fixed Window | Poor (boundary burst) | O(1) per client | Approximate | Simple internal rate limiting |
| Sliding Window Log | Excellent | O(n) per client | Exact | Low-volume, high-accuracy needs |
| Sliding Window Counter | Good | O(1) per client | Approximate | Production APIs (recommended default) |
| Token Bucket | Configurable burst | O(1) per client | Exact | APIs with burst allowance |
For most production APIs: Use the sliding window counter or token bucket. The sliding window counter is simpler and works well when you want a strict rate. The token bucket is better when you want to allow bursts.
Distributed Rate Limiting with Redis#
In a distributed system with multiple API server instances, rate limiting must be centralized. Redis is the standard choice because it is fast (sub-millisecond operations), supports atomic operations (Lua scripts), and is widely deployed.
Redis Considerations#
Latency: Each rate limit check adds a Redis round-trip (typically 0.5-2ms on the same network). This is acceptable for most APIs but can be significant for ultra-low-latency paths.
Failure mode: When Redis is unavailable, you have two options:
- Fail open: Allow all requests through. Protects availability but loses rate limiting during Redis outages.
- Fail closed: Reject all requests. Protects against abuse but causes total outage during Redis failures.
For most services, fail open is the right choice. Rate limiting is a protection mechanism, not a core business function.
def rate_limit_with_fallback(client_id: str, limit: int) -> bool:
try:
return sliding_window_counter_check(client_id, limit, 60)
except redis.ConnectionError:
# Fail open: allow the request
log.warning(f"Redis unavailable, skipping rate limit for {client_id}")
return TrueRedis Cluster: For high availability, use Redis Sentinel or Redis Cluster. With Redis Cluster, ensure your rate limit keys for the same client hash to the same shard by using hash tags: {client_id}:ratelimit.
API Gateway Rate Limiting#
API gateways handle rate limiting before requests reach your application, reducing load on application servers.
NGINX#
# Define rate limiting zones
limit_req_zone $binary_remote_addr zone=per_ip:10m rate=10r/s;
limit_req_zone $http_x_api_key zone=per_key:10m rate=100r/s;
server {
location /api/ {
# Allow burst of 20, process excess with delay
limit_req zone=per_ip burst=20 delay=10;
limit_req zone=per_key burst=50 nodelay;
limit_req_status 429;
proxy_pass http://backend;
}
}The burst parameter defines a queue. With delay=10, the first 10 excess requests are processed immediately, the next 10 are delayed, and beyond 20 excess requests are rejected. With nodelay, all burst requests are processed immediately but the burst bucket must refill at the base rate.
Kong#
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: rate-limit
plugin: rate-limiting
config:
minute: 100
hour: 5000
policy: redis
redis_host: redis.default.svc
redis_port: 6379
redis_database: 0
fault_tolerant: true # fail open on Redis errors
hide_client_headers: falseEnvoy (Istio)#
Envoy supports local rate limiting (per-pod) and global rate limiting (via an external rate limit service).
# Istio EnvoyFilter for local rate limiting
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: local-ratelimit
spec:
workloadSelector:
labels:
app: my-service
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_INBOUND
listener:
filterChain:
filter:
name: envoy.filters.network.http_connection_manager
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.local_ratelimit
typed_config:
"@type": type.googleapis.com/udpa.type.v1.TypedStruct
type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
value:
stat_prefix: http_local_rate_limiter
token_bucket:
max_tokens: 100
tokens_per_fill: 100
fill_interval: 60s
filter_enabled:
runtime_key: local_rate_limit_enabled
default_value:
numerator: 100
denominator: HUNDREDRate Limit Response Headers#
The IETF RateLimit header fields (draft-ietf-httpapi-ratelimit-headers) standardize how servers communicate rate limit status to clients. Implement these headers on every rate-limited endpoint.
HTTP/1.1 200 OK
RateLimit-Limit: 100
RateLimit-Remaining: 57
RateLimit-Reset: 1708632000
HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1708632000
Retry-After: 30- RateLimit-Limit: The maximum number of requests allowed in the window.
- RateLimit-Remaining: How many requests the client can still make in this window.
- RateLimit-Reset: Unix timestamp when the window resets (or seconds until reset, depending on implementation).
- Retry-After: Included on 429 responses. Tells the client how many seconds to wait before retrying.
from flask import Flask, jsonify, make_response
app = Flask(__name__)
@app.route('/api/data')
def get_data():
client_id = request.headers.get('X-API-Key', request.remote_addr)
limit = 100
window = 60
allowed, remaining, reset_at = check_rate_limit(client_id, limit, window)
if not allowed:
resp = make_response(jsonify({'error': 'Rate limit exceeded'}), 429)
resp.headers['Retry-After'] = str(int(reset_at - time.time()))
else:
resp = make_response(jsonify({'data': 'value'}))
resp.headers['RateLimit-Limit'] = str(limit)
resp.headers['RateLimit-Remaining'] = str(max(0, remaining))
resp.headers['RateLimit-Reset'] = str(int(reset_at))
return respGraceful Degradation#
Rate limiting should not be a binary allow/deny. Implement tiered degradation that maintains service quality for well-behaved clients while protecting the system.
Priority-Based Rate Limiting#
Assign different limits based on client tier:
RATE_LIMITS = {
'enterprise': {'requests_per_minute': 10000, 'burst': 500},
'pro': {'requests_per_minute': 1000, 'burst': 100},
'free': {'requests_per_minute': 60, 'burst': 10},
'anonymous': {'requests_per_minute': 10, 'burst': 5},
}Endpoint-Specific Limits#
Not all endpoints cost the same. A search endpoint that hits the database is more expensive than a cached GET endpoint.
ENDPOINT_COSTS = {
'GET /api/users/{id}': 1, # cheap: cached
'GET /api/search': 5, # expensive: database query
'POST /api/reports': 20, # very expensive: background job
}Deduct the endpoint cost from the token bucket instead of always deducting 1. This prevents expensive endpoints from being called as frequently as cheap ones.
Shedding Strategy Under Global Overload#
When the system is under global load (not just one client exceeding limits), implement progressive shedding:
- Level 1 (80% capacity): Reduce limits for anonymous/free-tier clients by 50%.
- Level 2 (90% capacity): Reject all anonymous requests. Reduce free-tier to 10% of normal.
- Level 3 (95% capacity): Reject all free-tier requests. Reduce pro-tier by 50%.
- Level 4 (99% capacity): Serve only enterprise clients and health checks.
def adjusted_rate_limit(client_tier: str, system_load: float) -> int:
base_limit = RATE_LIMITS[client_tier]['requests_per_minute']
if system_load < 0.8:
return base_limit
elif system_load < 0.9:
if client_tier == 'anonymous':
return base_limit // 2
return base_limit
elif system_load < 0.95:
if client_tier == 'anonymous':
return 0
if client_tier == 'free':
return base_limit // 10
return base_limit
else:
if client_tier in ('anonymous', 'free'):
return 0
if client_tier == 'pro':
return base_limit // 2
return base_limitPractical Checklist for Agents#
When implementing rate limiting for a service:
- Choose the algorithm. Sliding window counter for strict rate enforcement, token bucket for burst-friendly APIs.
- Choose the enforcement layer. API gateway for external-facing rate limits, application-level for business logic rate limits (e.g., per-user actions).
- Use Redis for distributed rate limiting with Lua scripts for atomicity.
- Always return RateLimit headers on every response, not just 429s. Clients need to know their remaining quota.
- Include Retry-After on 429 responses so well-behaved clients back off correctly.
- Fail open when Redis is unavailable unless the service handles payments or security-critical operations.
- Log rate limit events with client ID, endpoint, and current count for debugging and capacity planning.
- Test with load generators (k6, wrk, vegeta) to verify limits work under actual concurrent traffic.