Rate Limiting Implementation Patterns#

Rate limiting controls how many requests a client can make within a time period. It protects services from overload, ensures fair usage across clients, prevents abuse, and provides a mechanism for graceful degradation under load. Every production API needs rate limiting at some layer.

Algorithm Comparison#

Fixed Window#

The simplest algorithm. Divide time into fixed windows (e.g., 1-minute intervals) and count requests per window. When the count exceeds the limit, reject requests until the next window starts.

Window: 12:00:00 - 12:00:59 | Limit: 100 requests
Request at 12:00:30: counter = 57, allowed
Request at 12:00:45: counter = 101, rejected (429)
Window resets at 12:01:00: counter = 0

Implementation with Redis:

import redis
import time

r = redis.Redis()

def fixed_window_check(client_id: str, limit: int, window_seconds: int) -> bool:
    key = f"ratelimit:{client_id}:{int(time.time()) // window_seconds}"
    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, window_seconds)
    count, _ = pipe.execute()
    return count <= limit

Problem: The boundary burst. A client can send 100 requests at 12:00:59 and 100 more at 12:01:00, effectively making 200 requests in 2 seconds while staying within the 100/minute limit on both windows. This is the primary weakness of fixed window.

Sliding Window Log#

Maintain a log of every request timestamp. When a new request arrives, remove entries older than the window duration and count the remaining entries. If the count exceeds the limit, reject the request.

def sliding_window_log_check(client_id: str, limit: int, window_seconds: int) -> bool:
    key = f"ratelimit:{client_id}"
    now = time.time()
    window_start = now - window_seconds

    pipe = r.pipeline()
    # Remove entries outside the window
    pipe.zremrangebyscore(key, 0, window_start)
    # Add current request
    pipe.zadd(key, {str(now): now})
    # Count entries in window
    pipe.zcard(key)
    # Set expiry on the whole key
    pipe.expire(key, window_seconds)
    _, _, count, _ = pipe.execute()

    return count <= limit

Advantage: No boundary burst problem. The window slides smoothly with time.

Disadvantage: Memory usage scales with request volume. For a high-traffic API, storing every timestamp is expensive. A client making 10,000 requests per minute requires 10,000 sorted set entries per client.

Sliding Window Counter#

A compromise between fixed window and sliding window log. Use two fixed windows and weight the counts based on how far into the current window the request arrives.

Previous window (12:00 - 12:01): 85 requests
Current window  (12:01 - 12:02): 20 requests so far
Current time: 12:01:15 (25% into current window)

Estimated count = (85 * 0.75) + 20 = 83.75
Limit: 100 -> allowed

def sliding_window_counter_check(
    client_id: str, limit: int, window_seconds: int
) -> bool:
    now = time.time()
    current_window = int(now) // window_seconds
    previous_window = current_window - 1
    elapsed = (now % window_seconds) / window_seconds

    prev_key = f"ratelimit:{client_id}:{previous_window}"
    curr_key = f"ratelimit:{client_id}:{current_window}"

    pipe = r.pipeline()
    pipe.get(prev_key)
    pipe.incr(curr_key)
    pipe.expire(curr_key, window_seconds * 2)
    prev_count, curr_count, _ = pipe.execute()

    prev_count = int(prev_count or 0)
    weighted = prev_count * (1 - elapsed) + curr_count

    return weighted <= limit

Advantage: Smooths the boundary burst, uses only two counters per client (constant memory), and is simple to implement.

Disadvantage: The count is an approximation. In practice, the approximation is close enough for rate limiting purposes.

Token Bucket#

The token bucket adds tokens at a fixed rate up to a maximum capacity. Each request consumes one token. If no tokens are available, the request is rejected. This allows bursts up to the bucket capacity while enforcing an average rate.

Bucket capacity: 10 tokens
Refill rate: 2 tokens/second

t=0: bucket=10, request costs 1 -> bucket=9, allowed
t=0: 9 more requests -> bucket=0, allowed
t=0: request -> bucket=0, rejected (no tokens)
t=1: 2 tokens added -> bucket=2
t=1: request -> bucket=1, allowed

def token_bucket_check(
    client_id: str, capacity: int, refill_rate: float
) -> bool:
    key = f"ratelimit:tb:{client_id}"
    now = time.time()

    # Atomic Lua script for token bucket
    lua_script = """
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])

    local data = redis.call('hmget', key, 'tokens', 'last_refill')
    local tokens = tonumber(data[1]) or capacity
    local last_refill = tonumber(data[2]) or now

    -- Add tokens based on elapsed time
    local elapsed = now - last_refill
    tokens = math.min(capacity, tokens + elapsed * refill_rate)

    if tokens >= 1 then
        tokens = tokens - 1
        redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
        redis.call('expire', key, math.ceil(capacity / refill_rate) * 2)
        return 1
    else
        redis.call('hmset', key, 'tokens', tokens, 'last_refill', now)
        redis.call('expire', key, math.ceil(capacity / refill_rate) * 2)
        return 0
    end
    """
    result = r.eval(lua_script, 1, key, capacity, refill_rate, now)
    return result == 1

Advantage: Allows controlled bursts. A client that has been idle accumulates tokens and can burst up to the bucket capacity. This matches real-world traffic patterns better than strict windowed counting.

Disadvantage: Slightly more complex to implement correctly, especially in distributed systems. The Lua script approach ensures atomicity in Redis.

Algorithm Selection Guide#

Algorithm	Burst Handling	Memory	Accuracy	Use Case
Fixed Window	Poor (boundary burst)	O(1) per client	Approximate	Simple internal rate limiting
Sliding Window Log	Excellent	O(n) per client	Exact	Low-volume, high-accuracy needs
Sliding Window Counter	Good	O(1) per client	Approximate	Production APIs (recommended default)
Token Bucket	Configurable burst	O(1) per client	Exact	APIs with burst allowance

For most production APIs: Use the sliding window counter or token bucket. The sliding window counter is simpler and works well when you want a strict rate. The token bucket is better when you want to allow bursts.

Distributed Rate Limiting with Redis#

In a distributed system with multiple API server instances, rate limiting must be centralized. Redis is the standard choice because it is fast (sub-millisecond operations), supports atomic operations (Lua scripts), and is widely deployed.

Redis Considerations#

Latency: Each rate limit check adds a Redis round-trip (typically 0.5-2ms on the same network). This is acceptable for most APIs but can be significant for ultra-low-latency paths.

Failure mode: When Redis is unavailable, you have two options:

Fail open: Allow all requests through. Protects availability but loses rate limiting during Redis outages.
Fail closed: Reject all requests. Protects against abuse but causes total outage during Redis failures.

For most services, fail open is the right choice. Rate limiting is a protection mechanism, not a core business function.

def rate_limit_with_fallback(client_id: str, limit: int) -> bool:
    try:
        return sliding_window_counter_check(client_id, limit, 60)
    except redis.ConnectionError:
        # Fail open: allow the request
        log.warning(f"Redis unavailable, skipping rate limit for {client_id}")
        return True

Redis Cluster: For high availability, use Redis Sentinel or Redis Cluster. With Redis Cluster, ensure your rate limit keys for the same client hash to the same shard by using hash tags: {client_id}:ratelimit.

API Gateway Rate Limiting#

API gateways handle rate limiting before requests reach your application, reducing load on application servers.

NGINX#

# Define rate limiting zones
limit_req_zone $binary_remote_addr zone=per_ip:10m rate=10r/s;
limit_req_zone $http_x_api_key    zone=per_key:10m rate=100r/s;

server {
    location /api/ {
        # Allow burst of 20, process excess with delay
        limit_req zone=per_ip burst=20 delay=10;
        limit_req zone=per_key burst=50 nodelay;

        limit_req_status 429;

        proxy_pass http://backend;
    }
}

The burst parameter defines a queue. With delay=10, the first 10 excess requests are processed immediately, the next 10 are delayed, and beyond 20 excess requests are rejected. With nodelay, all burst requests are processed immediately but the burst bucket must refill at the base rate.

Kong#

apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: rate-limit
plugin: rate-limiting
config:
  minute: 100
  hour: 5000
  policy: redis
  redis_host: redis.default.svc
  redis_port: 6379
  redis_database: 0
  fault_tolerant: true    # fail open on Redis errors
  hide_client_headers: false

Envoy (Istio)#

Envoy supports local rate limiting (per-pod) and global rate limiting (via an external rate limit service).

# Istio EnvoyFilter for local rate limiting
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: local-ratelimit
spec:
  workloadSelector:
    labels:
      app: my-service
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_INBOUND
        listener:
          filterChain:
            filter:
              name: envoy.filters.network.http_connection_manager
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.filters.http.local_ratelimit
          typed_config:
            "@type": type.googleapis.com/udpa.type.v1.TypedStruct
            type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
            value:
              stat_prefix: http_local_rate_limiter
              token_bucket:
                max_tokens: 100
                tokens_per_fill: 100
                fill_interval: 60s
              filter_enabled:
                runtime_key: local_rate_limit_enabled
                default_value:
                  numerator: 100
                  denominator: HUNDRED

Rate Limit Response Headers#

The IETF RateLimit header fields (draft-ietf-httpapi-ratelimit-headers) standardize how servers communicate rate limit status to clients. Implement these headers on every rate-limited endpoint.

HTTP/1.1 200 OK
RateLimit-Limit: 100
RateLimit-Remaining: 57
RateLimit-Reset: 1708632000

HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 1708632000
Retry-After: 30

RateLimit-Limit: The maximum number of requests allowed in the window.
RateLimit-Remaining: How many requests the client can still make in this window.
RateLimit-Reset: Unix timestamp when the window resets (or seconds until reset, depending on implementation).
Retry-After: Included on 429 responses. Tells the client how many seconds to wait before retrying.

from flask import Flask, jsonify, make_response

app = Flask(__name__)

@app.route('/api/data')
def get_data():
    client_id = request.headers.get('X-API-Key', request.remote_addr)
    limit = 100
    window = 60

    allowed, remaining, reset_at = check_rate_limit(client_id, limit, window)

    if not allowed:
        resp = make_response(jsonify({'error': 'Rate limit exceeded'}), 429)
        resp.headers['Retry-After'] = str(int(reset_at - time.time()))
    else:
        resp = make_response(jsonify({'data': 'value'}))

    resp.headers['RateLimit-Limit'] = str(limit)
    resp.headers['RateLimit-Remaining'] = str(max(0, remaining))
    resp.headers['RateLimit-Reset'] = str(int(reset_at))
    return resp

Graceful Degradation#

Rate limiting should not be a binary allow/deny. Implement tiered degradation that maintains service quality for well-behaved clients while protecting the system.

Priority-Based Rate Limiting#

Assign different limits based on client tier:

RATE_LIMITS = {
    'enterprise': {'requests_per_minute': 10000, 'burst': 500},
    'pro':        {'requests_per_minute': 1000,  'burst': 100},
    'free':       {'requests_per_minute': 60,    'burst': 10},
    'anonymous':  {'requests_per_minute': 10,    'burst': 5},
}

Endpoint-Specific Limits#

Not all endpoints cost the same. A search endpoint that hits the database is more expensive than a cached GET endpoint.

ENDPOINT_COSTS = {
    'GET /api/users/{id}':   1,   # cheap: cached
    'GET /api/search':       5,   # expensive: database query
    'POST /api/reports':     20,  # very expensive: background job
}

Deduct the endpoint cost from the token bucket instead of always deducting 1. This prevents expensive endpoints from being called as frequently as cheap ones.

Shedding Strategy Under Global Overload#

When the system is under global load (not just one client exceeding limits), implement progressive shedding:

Level 1 (80% capacity): Reduce limits for anonymous/free-tier clients by 50%.
Level 2 (90% capacity): Reject all anonymous requests. Reduce free-tier to 10% of normal.
Level 3 (95% capacity): Reject all free-tier requests. Reduce pro-tier by 50%.
Level 4 (99% capacity): Serve only enterprise clients and health checks.

def adjusted_rate_limit(client_tier: str, system_load: float) -> int:
    base_limit = RATE_LIMITS[client_tier]['requests_per_minute']

    if system_load < 0.8:
        return base_limit
    elif system_load < 0.9:
        if client_tier == 'anonymous':
            return base_limit // 2
        return base_limit
    elif system_load < 0.95:
        if client_tier == 'anonymous':
            return 0
        if client_tier == 'free':
            return base_limit // 10
        return base_limit
    else:
        if client_tier in ('anonymous', 'free'):
            return 0
        if client_tier == 'pro':
            return base_limit // 2
        return base_limit

Practical Checklist for Agents#

When implementing rate limiting for a service:

Choose the algorithm. Sliding window counter for strict rate enforcement, token bucket for burst-friendly APIs.
Choose the enforcement layer. API gateway for external-facing rate limits, application-level for business logic rate limits (e.g., per-user actions).
Use Redis for distributed rate limiting with Lua scripts for atomicity.
Always return RateLimit headers on every response, not just 429s. Clients need to know their remaining quota.
Include Retry-After on 429 responses so well-behaved clients back off correctly.
Fail open when Redis is unavailable unless the service handles payments or security-critical operations.
Log rate limit events with client ID, endpoint, and current count for debugging and capacity planning.
Test with load generators (k6, wrk, vegeta) to verify limits work under actual concurrent traffic.