Infrastructure•Intermediate•36 min read

Design a Rate Limiter

Asked at:GoogleAmazonStripeCloudflareNetflix

Tech:ShardingConsistencyAvailabilityPub/subAPI gatewayRate limiting

The Protector That Must Outrun the Flood

Every system has a breaking point. A database that handles 10,000 queries per second collapses at 50,000. An API server that serves 200 ms responses at normal load starts timing out at 3× traffic. The rate limiter exists to prevent that collapse — to stand at the gate and decide, for every single request, whether to let it through or turn it away.

Here is the paradox: the system protecting you from overload must itself handle more load than anything it protects. If your API processes 200,000 requests per second on average and peaks at 1 million, the rate limiter must make 1 million decisions per second — each in under 5 milliseconds — without becoming the bottleneck it was designed to prevent. If the limiter is slow, every request is slow. If the limiter is down, you choose between two bad options: let everything through (no protection) or block everything (total outage).

This post walks through designing a distributed rate limiter from first principles. We will start with the algorithm (how to decide allow or deny), build the storage layer (where to keep state), add distributed coordination (how to enforce limits across multiple servers), and harden for failure (what happens when the limiter itself breaks). Every decision will be driven by the numbers: 1 million decisions per second, 5 ms latency budget, 10 million active keys.

Let us begin.

1. Requirements — What the Limiter Must Enforce

A rate limiter is infrastructure, not a product. It sits in the request path — typically in the API gateway or as a middleware — and makes a binary decision for every request: allow or deny. The decision must be fast, accurate, and consistent across all servers handling traffic for the same client.

Functional Requirements

Enforce per-key rate limits. Each request is identified by a key — a user ID, an API key, an IP address, or a composite — and the limiter enforces a configured maximum request rate for that key. Different keys can have different limits.
Support both burst tolerance and sustained-rate control. Real clients are bursty. A mobile app loading a screen might fire 15 requests in 200 milliseconds, then go quiet. The limiter should allow short bursts while enforcing an average rate over time — not punish a legitimate page load.
Return allow/deny decisions in real time. Every request needs a decision before it reaches the downstream service. The decision must include enough metadata (remaining budget, reset time) for clients to implement intelligent backoff.

Non-Functional Requirements

Decision latency p95 under 5 ms on the server side. The limiter adds overhead to every request. At 5 ms, it is invisible inside a 200 ms API response. At 50 ms, it doubles a fast endpoint's latency.
High availability. The limiter must not be a single point of failure. If the limiter goes down, the system must have a defined fallback behavior — not an undefined one.
Accuracy close to configured policy. If the policy says 100 requests per minute, actual enforcement should be within a small margin — not 200 because of distributed counting errors.
Horizontal scalability to 1 million decisions per second. The limiter must scale with the traffic it protects.
Tenant isolation. One noisy client hammering the limiter should not degrade decision quality or latency for other clients.

Scope Control

In scope: rate limiting algorithms, distributed state management, failure modes, deployment topology.

Out of scope: billing integration, user-facing analytics dashboards, long-term usage reporting.

Now that we know what the limiter must do, we need to understand the scale it must handle. These numbers will determine whether a simple counter suffices or we need a distributed, sharded state store.

Login to continue reading

You reached the preview limit. Sign in to unlock the remaining sections.

2. Capacity Estimation — The Numbers That Determine Everything

Throughput

Protected API traffic: 200,000 requests/s average
Peak multiplier: 5× (traffic spikes, batch jobs, viral events)
Peak decisions per second: 1,000,000/s
Unique active keys in a peak minute: ~10 million (user IDs, API keys, IPs)
Common policy example: 100 requests per minute per key

Every single API request requires one rate limit check. The limiter is the single highest-QPS component in the entire system.

State Storage

The state we need to store depends on the algorithm. Let us compare two approaches:

Sliding window log (per-request timestamps): stores the timestamp of every request within the window. For a 1-minute window at 1M requests/s: 60 million timestamps × 16 bytes each = ~960 MB per minute of history. For multi-minute windows or multiple policies, this quickly becomes gigabytes. At 1M lookups/s, scanning and pruning these timestamp lists adds significant CPU overhead.

Token bucket (counter-based): stores one small state object per active key: current token count (8 bytes), last refill timestamp (8 bytes), policy reference (8 bytes), key identifier (32 bytes), plus overhead (~8 bytes). Total: ~64 bytes per key.

Let us break that down: key hash (32 bytes for a SHA-256 of the composite key), tokens remaining (8 bytes float64), lastRefillTs (8 bytes epoch millis), policyId reference (8 bytes), padding/overhead (8 bytes). Total: 64 bytes per active key.

For 10 million active keys: 10M × 64 bytes = ~640 MB of working state

Decision: counter/state-based algorithms (token bucket) over per-request log algorithms. 640 MB is a comfortable working set for an in-memory store. 960 MB per minute of sliding log data, growing with window size, is impractical at this scale.

Redis Shard Sizing

If we use Redis as our shared state store (we will justify this shortly), we need to know how many shards we need. A single Redis instance executing Lua scripts (which our token bucket implementation requires for atomicity) can safely sustain roughly 100,000-150,000 script operations per second at sub-5ms p95 latency. This range depends on script complexity, payload size, and the specific hardware — it is a planning estimate, not a universal constant, and must be validated with load testing against your actual command mix.

Using 120,000 ops/s as our planning baseline:

Peak decisions: 1,000,000/s
Shards needed: 1,000,000 ÷ 120,000 ≈ 9 shards minimum
With headroom for failover and hot keys: 12 shards

With these numbers in hand, we can design the data model.

3. Core Entities — What the Limiter Stores

A rate limiter's data model is unusually small — just three entities — but each one is on the hottest possible path. The RateState entity is read and written on every single API request.

RatePolicy

code

RatePolicy(policyId, keyType, algorithm, ratePerSec, burst, windowSec, failMode)

This is the configuration that defines a rate limit rule. It is set by operators and changes infrequently.

keyType determines how the request is identified: userId, apiKey, ip, or a composite like userId:endpoint. Different endpoints need different fairness models — a login endpoint might limit by IP (to prevent credential stuffing), while a data API limits by API key (to enforce tier-based quotas).

ratePerSec and burst define the throttle contract for token bucket policies. ratePerSec: 1.67 (100 per minute) means tokens refill at 1.67 per second. burst: 20 means the bucket can hold at most 20 tokens, allowing a short spike of 20 rapid requests before throttling kicks in. Why both? Without burst, a client sending 5 requests in one second to load a dashboard would be denied 3 of them even though their average rate is well within limits. Burst accommodates legitimate bursty behavior.

failMode specifies what happens when the limiter backend is unavailable: OPEN (allow the request — prioritize availability) or CLOSED (deny the request — prioritize safety). An auth endpoint might use CLOSED (better to briefly block logins than let a credential-stuffing attack through). A read-only search endpoint might use OPEN (better to serve potentially over-limit searches than return errors).

RateState

code

RateState(key, policyId, tokens, lastRefillTs)

This is the mutable runtime state — one record per active key. It lives in Redis and is read+updated atomically on every rate limit check.

tokens is the current token count in the bucket. When a request arrives, the limiter first refills tokens based on elapsed time since lastRefillTs, then attempts to consume one token. If tokens >= 1, the request is allowed and tokens is decremented. If tokens < 1, the request is denied.

lastRefillTs anchors the refill calculation. Tokens are not refilled on a timer — they are refilled lazily when the next request arrives. The refill amount is (now - lastRefillTs) × ratePerSec, capped at burst. This lazy refill means idle keys cost zero processing — no background timers, no wasted work.

DecisionSample

code

DecisionSample(key, policyId, ts, allowed, remaining)

At 1 million decisions per second, logging every decision would generate ~64 GB/day of log data (64 bytes per log × 1M/s × 86,400s). That is impractical and unnecessary. Instead, we sample: log 1% of decisions (10,000/s) which provides enough data for debugging deny spikes, tuning policies, and identifying hot keys — without the storage cost.

Connecting Entities to Requirements

FR1 (enforce per-key limits) → RatePolicy defines the limit; RateState tracks consumption per key.
FR2 (burst + sustained control) → RatePolicy.burst caps immediate availability; RatePolicy.ratePerSec controls refill; RateState.tokens tracks remaining budget.
FR3 (real-time decisions) → RateState is the hot-path data; the response includes remaining (current tokens) and resetAt (when tokens refill to a useful level).

With entities defined, let us design the API.

4. API Design — One Decision Per Request

The rate limiter's API is the smallest and most latency-sensitive of any system we have designed. It is called on every protected request, so every byte of payload and every millisecond of latency matters.

The Decision API

code

isAllowed(key: string, policyId: string, now: timestamp)
  → { allowed: bool, remaining: int, retryAfter: seconds, resetAt: timestamp }

This is an internal RPC call, not an HTTP endpoint exposed to end users. The limiter runs as a middleware in the API gateway or as a sidecar service — it is infrastructure, not a user-facing API.

key identifies the throttle subject. The caller (gateway/middleware) constructs the key from the request context: for a per-user limit, the key might be user:u789; for a per-endpoint limit, user:u789:/v1/search. The limiter does not parse request headers to derive keys — the caller has the auth context and routing information needed to build the composite key.

policyId selects which rate limit policy to apply. A single user might have different limits on different endpoints: 100/min on /v1/search, 10/min on /v1/export. The caller knows which endpoint was hit and which policy applies.

now is an explicit timestamp parameter. In production, this is the current server time. But passing it explicitly enables deterministic unit testing (advance the clock to test refill behavior) and time-simulation in load tests.

Response Fields and HTTP Headers

When the limiter denies a request, the gateway returns 429 Too Many Requests (as defined by RFC 6585) with standard rate limit headers:

code

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1710412400
Retry-After: 12

X-RateLimit-Limit — the configured limit for this key/policy. Tells the client what their budget is.

X-RateLimit-Remaining — how many requests remain in the current window. Enables clients to self-throttle before hitting the limit.

X-RateLimit-Reset — Unix timestamp when the budget resets (or meaningful tokens become available). Enables precise backoff.

Retry-After — seconds until a retry is likely to succeed. Prevents clients from hammering the 429 endpoint.

These headers are returned on every response, not just 429s. This gives well-behaved clients continuous visibility into their budget, reducing the number of requests that actually hit the limit.

Management APIs (Control Plane)

code

PUT  /v1/policies/{policyId}   — create or update a rate limit policy
GET  /v1/policies/{policyId}   — read the current policy configuration

These are separate from the decision API — they are control plane operations, called by operators, not by every request. Policy changes propagate to limiter nodes via a push mechanism (we will design this in Step D).

Mapping APIs to Requirements

FR1 → isAllowed(key, policyId, now) applies the correct policy to the correct key.
FR2 → response remaining and resetAt expose burst and sustained-rate state to clients.
FR3 → the entire decision API is designed for the hot path: minimal payload, sub-5ms latency target.

With the API defined, we can build the system that implements it.

5. High-Level Design — Building the Limiter Step by Step

Step A: The Token Bucket Algorithm

Before we build any infrastructure, we need to understand the algorithm that powers every decision. The token bucket is the intellectual core of this system — if you understand it, the rest of the design follows naturally.

How it works: imagine a bucket that holds tokens. The bucket has a maximum capacity (the burst parameter). Tokens are added to the bucket at a steady rate (the ratePerSec parameter). When a request arrives, it must take one token from the bucket. If a token is available, the request is allowed and the token is consumed. If the bucket is empty, the request is denied.

Concrete example: a policy of 100 requests/minute with burst: 20.

The refill rate is 100 ÷ 60 ≈ 1.67 tokens per second. The bucket capacity is 20 tokens. At startup, the bucket is full (20 tokens). A client fires 15 requests in rapid succession — all 15 are allowed (15 tokens consumed, 5 remaining). The client pauses for 6 seconds. During those 6 seconds, 6 × 1.67 ≈ 10 tokens are refilled, bringing the bucket to 15 tokens. The client fires another burst of 12 — all allowed (15 - 12 = 3 remaining). The client then sustains a steady rate of 2 requests per second. Since refill is 1.67/s and consumption is 2/s, the bucket slowly drains. After about 9 seconds, the bucket empties and the client starts receiving 429s until the rate drops below 1.67/s.

Why not fixed window? A fixed-window counter (e.g., "100 requests allowed between 12:00 and 12:01") has a well-known boundary exploit: a client sends 100 requests at 12:00:59 and 100 more at 12:01:00. Both are within their respective windows, but the client effectively sent 200 requests in 2 seconds — double the intended rate. Token bucket has no window boundaries to exploit; tokens refill continuously.

Why not sliding window log? A sliding window log stores the timestamp of every request within the window and counts them on each check. This gives perfect accuracy but costs O(n) memory per key (where n is the number of requests in the window). At 1M requests/s with a 60-second window, that is 60M timestamps in memory — nearly 1 GB just for one minute of data. Token bucket uses O(1) memory per key: 64 bytes regardless of traffic volume.

Why not sliding window counter? The sliding window counter is an elegant approximation: it keeps a count for the current fixed window and the previous fixed window, then computes a weighted combination based on how far through the current window we are. For example, if 30 seconds into a 60-second window, the effective count is previousWindow × 0.5 + currentWindow × 1.0. This gives better boundary behavior than fixed windows with O(1) memory. It is a valid choice — but it only controls sustained rate, not burst. Token bucket gives us both burst tolerance (via the bucket capacity) and sustained-rate control (via the refill rate) in a single mechanism.

The atomic Lua script: in a distributed system where multiple API gateway instances check the same key simultaneously, the token bucket update must be atomic — read the current state, compute refill, check token availability, consume a token, and write back — all as one indivisible operation. If two requests read the same state concurrently, both might see 1 token remaining and both consume it, allowing 2 requests when only 1 should pass.

Redis Lua scripts solve this. Redis executes Lua scripts atomically — no other command can interleave during script execution. Here is the logic:

code

-- Lua script executing atomically in Redis
local tokens = tonumber(redis.call('HGET', key, 'tokens') or burst)
local lastRefill = tonumber(redis.call('HGET', key, 'lastRefillTs') or now)
local elapsed = now - lastRefill
local refill = math.min(burst, tokens + elapsed * ratePerSec)
if refill >= 1 then
    redis.call('HSET', key, 'tokens', refill - 1, 'lastRefillTs', now)
    return {1, math.floor(refill - 1)}  -- allowed, remaining
else
    redis.call('HSET', key, 'lastRefillTs', now)
    return {0, 0}  -- denied, remaining=0
end

This script runs in ~0.1 ms per execution inside Redis — fast enough for 120,000+ executions per second per shard.

Why Redis and not Memcached? Memcached is a fast in-memory store, but it does not support server-side scripting. The token bucket update requires an atomic read-compute-write cycle — read current tokens, compute refill, decide allow/deny, write updated tokens. In Memcached, the client would read, compute locally, and write — but between the read and write, another instance could read the same stale value. Memcached's incr/decr commands are atomic but only support simple counter operations, not the conditional refill logic our algorithm needs. Redis Lua scripts provide exactly the atomicity we require.

Why not DynamoDB or Postgres? Both support atomic conditional updates, but at 1 million operations per second, both add latency that violates our 5 ms budget. DynamoDB conditional writes average 5-10 ms; Postgres transactions involve disk I/O. Redis operations complete in ~0.1 ms because everything is in memory with no disk persistence on the hot path. For a component that must add less than 2 ms to every API call, in-memory is not optional — it is mandatory.

code

┌──────────┐     ┌──────────────┐     ┌─────────────────┐
│  Client   │────▶│  API Gateway │────▶│  Rate Limiter   │
│           │◀────│  (middleware) │◀────│  (Lua script    │
└──────────┘     └──────────────┘     │   in Redis)     │
                  allow → forward                        │
                  deny  → 429        └─────────────────┘

What this step gives us: a correct, atomic rate limiting algorithm that handles burst and sustained rate with O(1) memory per key. But it runs on a single Redis instance — which cannot handle 1M decisions/s. That is next.

Step B: Distributed State Across Redis Shards

A single Redis instance handles ~120,000 Lua script operations per second. We need 1,000,000. The solution is sharding: distribute keys across multiple Redis instances so each handles a fraction of the total load.

code

┌──────────────┐
│  API Gateway │
│  (any of N   │
│   instances) │
└──────┬───────┘
       │ isAllowed(key, policyId, now)
       ▼
┌──────────────┐     ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│  Limiter     │────▶│Redis 1│ │Redis 2│ │Redis 3│ │...    │
│  Client      │     │(shard)│ │(shard)│ │(shard)│ │Redis12│
│  (key →      │     └───────┘ └───────┘ └───────┘ └───────┘
│   shard map) │
└──────────────┘
   key hash → shard assignment

Key-to-shard mapping: the limiter client computes a consistent hash of the key and maps it to one of 12 shards. All gateway instances use the same hash function, so the same key always hits the same shard — ensuring that token state is centralized per key.

Why consistent hashing and not simple hash-mod? If we use hash(key) % 12 and later add a 13th shard, nearly every key remaps to a different shard, losing its rate state. With consistent hashing, adding a shard only remaps ~1/13th of keys — minimal disruption, and the affected keys simply start with fresh token buckets (a brief period of reduced enforcement accuracy, not a correctness failure).

12 shards across availability zones: we deploy shards across multiple availability zones so that a single AZ failure does not take down the limiter. Each shard has a replica in a different AZ for failover. During normal operation, only the primary serves traffic. If a primary fails, the replica promotes within seconds.

What this step gives us: distributed rate limiting that scales to 1M decisions/s across 12 shards. But every decision still requires a network round-trip to Redis. At 1M decisions/s, that is 1 million network calls per second from the gateway fleet. Can we reduce this? Yes — with a local tier.

Step C: Two-Tier Limiting — Local + Global

The most impactful optimization for a rate limiter is adding a local in-process counter in each API gateway instance. This local tier absorbs bursts and reduces Redis traffic dramatically.

code

┌──────────────────────────────────────────┐
│               API Gateway                 │
│                                           │
│  ┌─────────────┐     ┌────────────────┐  │
│  │ Local Rate  │────▶│ Global Rate    │  │
│  │ Limiter     │     │ Limiter        │  │
│  │ (in-process │     │ (Redis call)   │  │
│  │  counter)   │     │                │  │
│  └─────────────┘     └────────────────┘  │
│   Fast path:          Slow path:          │
│   ~0.01ms             ~1-2ms              │
│   Catches obvious     Enforces exact      │
│   over-limit          distributed limit   │
└──────────────────────────────────────────┘

How two-tier works: each gateway instance maintains a local token bucket per key with a fraction of the global limit. If the global limit is 100/min across all instances, and there are 20 gateway instances, each local bucket gets a quota of 5/min. When a request arrives, the local limiter checks first. If the local bucket has tokens, the request is allowed without touching Redis — a decision in ~0.01 ms (in-process memory access) instead of ~1-2 ms (network round-trip to Redis).

If the local bucket is empty, the gateway calls the global Redis limiter for the definitive decision. This two-tier approach means the vast majority of requests — especially from normal-rate clients — are decided locally, and only edge cases (near-limit clients, new keys, bucket refills) hit Redis.

How much does this reduce Redis load? In practice, 60-80% of decisions are absorbed by the local tier. At 1M peak decisions/s, this reduces Redis traffic to 200,000-400,000 ops/s — meaning we might need only 4-6 shards instead of 12. We keep 12 for headroom, which gives us generous capacity for hot keys and growth.

The accuracy tradeoff: local counters are not globally synchronized in real-time. If a key's traffic is spread across 20 gateway instances, each instance might allow up to its local quota before the global check catches the overflow. The effective over-admission is bounded: in the worst case, a key can exceed its limit by localQuota × instanceCount before global enforcement kicks in. For a limit of 100/min with 20 instances each holding 5/min local quota, the worst-case over-admission is 100 extra requests (each instance allows 5 before reporting to global). In practice, the over-admission is much smaller because traffic is not perfectly distributed, and the global check happens frequently.

This level of inaccuracy is acceptable for most rate limiting use cases. For the rare cases where exact enforcement matters (financial API rate limits, for example), the local tier can be bypassed and every request goes directly to the global Redis check.

What this step gives us: a two-tier limiter that decides most requests in ~0.01 ms (local), falls back to Redis for ~1-2 ms decisions, and scales to 1M+ decisions/s with modest Redis load. But we have not addressed what happens when Redis itself is unavailable. That is next.

Step D: Failure Modes, Policy Propagation, and Observability

The rate limiter sits in the critical request path. If Redis is slow or down, every API request is affected. The system must have a defined, deliberate behavior for every failure mode — not an accidental one.

Fail-open vs. fail-closed per endpoint:

When the Redis shard is unreachable, the gateway must decide: allow the request (fail-open) or deny it (fail-closed). This decision is endpoint-specific, configured in the RatePolicy.failMode field:

Fail-open (availability priority): read-only endpoints, search, product listings. Better to serve some over-limit requests than to return errors for all users during a Redis hiccup. The local tier still provides rough rate limiting during the outage.
Fail-closed (safety priority): authentication endpoints, payment APIs, admin operations. Better to briefly block all requests than to allow an unchecked flood against security-sensitive surfaces.

The gateway enforces a 2 ms timeout on the Redis call. If Redis does not respond within 2 ms, the gateway applies the failMode immediately — no retries, no waiting. At a 5 ms total latency budget, 2 ms for the Redis call leaves 3 ms for local processing, network overhead, and response formatting.

Circuit breaker: if Redis fails for a sustained period (error rate >50% over 10 seconds), the gateway opens a circuit breaker and stops calling Redis entirely. All decisions are made by the local tier (with its coarser accuracy) until the breaker's probe detects Redis recovery.

Policy propagation: when an operator updates a policy via PUT /v1/policies/{policyId}, the change must reach all gateway instances quickly. The flow: the management API writes the updated policy to Postgres (durable storage) and publishes an invalidation event via Redis Pub/Sub. Each gateway instance subscribes to the policy channel and updates its local policy cache within seconds. If Pub/Sub delivery fails, the local cache has a short TTL (60 seconds), so stale policies are refreshed within a minute at worst.

Observability: at 1M decisions/s, full logging is impractical (~64 GB/day). Instead:

Metrics: allow/deny rate per key type, per policy, per shard. P95 and p99 decision latency. Redis shard saturation (ops/s vs capacity). These are emitted as counters/histograms to a metrics system (Prometheus, Datadog) at aggregation intervals.
Hot key detection: a background process tracks the top-N keys by request volume per shard. If any key exceeds a threshold (e.g., >5% of a shard's traffic), it is flagged as a hot key. Hot keys can be pre-allocated to a dedicated shard or handled with a local-only aggressive rate limit to protect the shard.
Sampled decision logs: 1% of decisions are logged with full context (key, policy, allowed/denied, remaining, latency). At 1M/s, 1% sampling gives 10,000 logs/s — enough for debugging deny spikes, identifying misconfigured policies, and tuning thresholds.

code

FINAL ARCHITECTURE:

┌──────────────────────────────────────────────────────┐
│                    API Gateway Fleet                  │
│  ┌─────────────┐     ┌──────────────────────────┐   │
│  │ Local Tier  │────▶│ Global Tier (Redis call)  │   │
│  │ (in-process)│     │ (2ms timeout + failMode)  │   │
│  └─────────────┘     └────────────┬─────────────┘   │
└──────────────────────────────────────────────────────┘
                                    │
                   ┌────────────────┼────────────────┐
                   ▼                ▼                ▼
            ┌──────────┐    ┌──────────┐    ┌──────────┐
            │ Redis 1  │    │ Redis 2  │    │ ...      │
            │ (shard,  │    │ (shard,  │    │ Redis 12 │
            │  AZ-a)   │    │  AZ-b)   │    │ (AZ-c)  │
            │  + replica│    │ + replica│    │+ replica │
            └──────────┘    └──────────┘    └──────────┘
                   ▲
                   │ Pub/Sub policy invalidation
            ┌──────────┐    ┌──────────┐
            │ Postgres │    │  Mgmt    │
            │ (policies│◀───│  API     │
            │  durable)│    │          │
            └──────────┘    └──────────┘

The Life of a Rate Check — One Request, Start to Finish

A client sends GET /v1/search?q=shoes to the API. The request hits the API gateway. Here is exactly what happens:

Gateway extracts the rate limit key. From the auth token: user:u789. From the route: /v1/search. Composite key: user:u789:/v1/search. The applicable policy: policy:search-standard (100 req/min, burst 20).

Local tier check. The gateway checks its in-process token bucket for this key. The local bucket (quota: 5/min for this instance) has 3 tokens. Local decision: allow. Token consumed. No Redis call needed. Latency: 0.01 ms.

Gateway forwards the request to the downstream search service with rate limit headers: X-RateLimit-Remaining: 42 (estimate from last global sync).

Search service processes and responds. 150 ms later, the client receives search results.

Now a different scenario — the same user has been sending requests rapidly and the local bucket is empty:

Local tier check. Local bucket has 0 tokens. Falls through to global tier.

Global tier check. Gateway sends Lua script to Redis shard #7 (determined by consistent hash of key). Redis executes atomically: refills tokens based on elapsed time, checks availability. Result: 2 tokens remaining. Allow. Token consumed. Latency: 1.5 ms.

Gateway forwards the request with updated headers: X-RateLimit-Remaining: 1.

And the denial scenario — the user has exhausted their limit:

Local tier: empty. Falls through to global.

Global tier: Redis returns 0 tokens. Decision: deny. Redis returns resetAt (when the next token will be available).

Gateway returns 429 with headers: X-RateLimit-Remaining: 0, Retry-After: 8, X-RateLimit-Reset: 1710412408.

Client implements backoff. The well-behaved client waits 8 seconds before retrying.

Now that the full system is built, the data model needs a minor update.

6. Core Entities (v2) — What the Architecture Revealed

code

RatePolicy(policyId, keyType, algorithm, ratePerSec, burst, windowSec, failMode, localQuotaFraction)
RateState(key, policyId, tokens, lastRefillTs)
HotKeyStat(key, shardId, requestRate, denyRate, minuteBucket)
DecisionSample(key, policyId, ts, allowed, remaining, tier, latencyMs)

RatePolicy gained localQuotaFraction. The two-tier architecture requires configuring what fraction of the global limit each local instance gets. Default: 1 / instanceCount, but some policies might want more aggressive local enforcement (smaller fraction) or more lenient local behavior (larger fraction). Making this configurable per policy gives operators precise control.

HotKeyStat is new. It tracks per-key request rate and deny rate per shard, aggregated into minute buckets. This powers the hot key detection mechanism in Step D: when a key's request rate exceeds 5% of its shard's capacity, the system flags it and can take action (dedicated shard, local-only enforcement, or operator alert).

DecisionSample gained tier and latencyMs. Knowing whether a decision was made by the local tier or the global tier (and how long it took) is critical for performance debugging. If 40% of decisions are hitting Redis when they should be handled locally, the local quota configuration needs adjustment.

7. Deep Dives — Breaking the Limiter on Purpose

Deep Dive 1: The Fixed-Window Boundary Exploit

It is 11:59:58 PM. A client has sent 2 requests in the current minute — 98 remaining in their 100/min budget. They send 98 requests in the next 2 seconds (still within the window). At 12:00:00, the window resets. They immediately send another 100 requests. In a 4-second span straddling the window boundary, the client sent 198 requests — nearly double their rate. The fixed-window counter saw nothing wrong: 100 in one window, 100 in the next.

Why this matters: for APIs protecting expensive operations (database queries, payment processing), a 2× burst at the boundary can overwhelm the downstream system exactly when the rate limiter is supposed to prevent it. If 10,000 clients independently exploit the boundary, the API receives 2× peak load for several seconds every minute.

The sliding window counter approximation addresses this elegantly. Instead of a hard boundary, it blends two windows: the current window's count and the previous window's count, weighted by position in the current window. At 30 seconds into a 60-second window, the effective count is: previousWindowCount × 0.5 + currentWindowCount × 1.0. If the previous window had 80 requests and the current has 60, the effective count is 80 × 0.5 + 60 = 100 — denied. This prevents the boundary exploit while using only two counters per key (O(1) memory, same as token bucket).

Why we still prefer token bucket: the sliding window counter prevents the boundary exploit but does not model burst explicitly. Token bucket gives us both continuous enforcement (no boundaries at all) and configurable burst tolerance (the bucket capacity). For systems that need only sustained-rate control, sliding window counter is a valid and simpler choice. For systems that need burst + sustained control (which is most real-world APIs), token bucket is the better default.

Deep Dive 2: Distributed Consistency — When Local Counters Lie

You have 20 API gateway instances, each with a local rate limiter. A client's limit is 100/min. Each local instance allows 5/min (100 ÷ 20). The client sends requests uniformly across all 20 instances. Each instance counts 5 requests, checks locally — all pass. The client has sent 100 requests, exactly at the limit. This works perfectly.

Now the client sends 6 requests to each instance. Each local limiter has a 5/min quota, so each denies the 6th request. The client sent 120 total but was denied 20 — effective admission was 100. The local tier caught it.

But what if traffic is not uniform? If a load balancer routes all of a client's requests to 3 instances (sticky sessions, geographic routing), those 3 instances each see 33 requests against a local quota of 5. They deny 28 each — the client successfully sent only 15 requests despite a 100/min limit. The local tier is too restrictive.

The fix: the global tier is always the authority. When the local bucket empties, the request goes to Redis. Redis has the real token count (100/min, shared across all instances). The global tier allows the request if the global budget permits. The local tier is a fast pre-filter, not the source of truth.

Accuracy guarantee: with the two-tier model, the worst-case over-admission is bounded by the total local quota across all instances. If each of 20 instances has a local quota of 5/min, the maximum over-admission is 100 requests (every instance allows 5 locally before the first global check). For a 100/min limit, that is a 2× over-admission in the worst pathological case. In practice, the global tier is consulted frequently enough that over-admission stays under 10%.

For policies where even 10% over-admission is unacceptable, set localQuotaFraction: 0 — every request goes directly to the global Redis check at the cost of higher latency (~1-2 ms per request instead of ~0.01 ms).

Deep Dive 3: The Hot Key Problem

A viral marketing campaign causes one API key to generate 200,000 requests per second — all mapped to a single Redis shard by consistent hashing. That shard's capacity is 120,000 ops/s. The shard's command queue grows. Latency spikes from 1 ms to 50 ms. Other keys on the same shard — belonging to completely unrelated tenants — are affected. One noisy client has degraded rate limiting for thousands of others.

The naive approach: no hot key detection. One tenant's traffic spike breaks isolation for everyone on the same shard. This violates the tenant isolation requirement.

The good approach: detect and shed. The hot key detection mechanism (tracking per-key request volume per shard) identifies the hot key within one minute bucket. Once detected, the gateway applies a local-only aggressive limiter for that specific key — rate-checking locally at a strict cap (e.g., 1,000/min) without calling Redis. The hot key's traffic never reaches the shard. Other tenants' experience is restored.

The great approach: pre-allocated hot-key sharding. For known high-traffic keys (enterprise API keys with published rate limits of 10,000+/min), assign them to dedicated Redis instances from the start. The consistent hash ring has weighted entries where premium keys map to their own shard. This provides predictable isolation without waiting for detection.

The tradeoff: more operational complexity. The number of dedicated shards scales with the number of high-traffic tenants. But for the top 50 API keys that generate 60% of traffic (a common distribution), 3-5 dedicated shards completely eliminate the hot key problem.

Deep Dive 4: Latency Budget Protection Under Redis Failures

Your API has a 50 ms p95 latency SLA. The rate limiter has a 5 ms budget within that. Redis shard #4 starts experiencing elevated latency — network jitter pushes its p95 from 1.5 ms to 25 ms. Without protection, every request routed to shard #4 adds 25 ms to the API response time, pushing the API's p95 from 50 ms to 75 ms — SLA violated.

The 2 ms timeout: the gateway sets a hard timeout of 2 ms on the Redis call. If shard #4 does not respond in 2 ms, the gateway applies failMode immediately. For a fail-open search endpoint, the request passes (relying on the local tier for rough rate limiting). For a fail-closed payment endpoint, the request is denied. Either way, the API's latency budget is preserved.

Circuit breaker escalation: if shard #4's timeout rate exceeds 50% for 10 seconds, the gateway opens the circuit breaker for that shard. For the next 30 seconds, no traffic is sent to shard #4 — all decisions for keys on that shard are made by the local tier (or fail-mode policy). Every 30 seconds, a probe (1% of traffic) tests whether shard #4 has recovered. When the probe succeeds, the breaker closes and global decisions resume.

The key insight: the 2 ms timeout and circuit breaker protect the API's latency SLA regardless of what happens to Redis. The rate limiter degrades in accuracy (local-only enforcement is less precise) but never degrades in latency. This is the correct tradeoff: a rate limiter that adds 25 ms to every request is doing more harm than no rate limiter at all.

Deep Dive 5: Multi-Tenancy and Noisy Neighbor Isolation

A SaaS platform uses the rate limiter to enforce per-tenant API quotas. Tenant A has a 10,000/min premium plan. Tenant B has a 100/min free plan. Both tenants' keys hash to the same Redis shard. Tenant A's legitimate high-volume traffic consumes 10,000 ops/min on the shard, leaving less headroom for Tenant B's checks. In extreme cases, Tenant A's traffic could push the shard's latency high enough that Tenant B's 2 ms timeout fires, and Tenant B gets fail-mode behavior instead of accurate rate limiting.

The solution is layered isolation:

Shard-level isolation for premium tenants (as described in Deep Dive 3): Tenant A's keys are routed to a dedicated shard. Tenant B's keys stay on the shared cluster. No cross-tenant impact.

Fair queuing within shared shards: Redis processes commands FIFO, so there is no built-in fairness between keys. However, the local tier provides implicit fairness: Tenant B's requests are mostly handled locally (their 100/min limit means the local tier rarely overflows), so they rarely hit Redis. Tenant A's high volume means they are the ones primarily consuming Redis capacity — but on their dedicated shard, this is fine.

Quota enforcement at the gateway level: even before the rate limiter runs, the gateway can check a per-tenant aggregate counter (updated lazily every few seconds) to detect tenants whose total request volume is far exceeding their plan. These tenants can be preemptively throttled at the gateway layer, before their requests even reach the per-key rate limiter.

8. Tying It All Together

We started with a question — "should this request be allowed?" — and built a system to answer it 1 million times per second.

Step	Problem Solved	Component Added	Number That Forced It
A	Correct, atomic rate decisions	Token bucket algorithm + Redis Lua script	1M decisions/s; 10M keys × 64B = 640 MB state
B	Scale beyond single Redis	12 shards with consistent hashing	120k ops/s per shard; 1M/s ÷ 120k = ~9 minimum
C	Reduce Redis pressure + latency	Local in-process tier per gateway	60-80% of decisions resolved in 0.01 ms
D	Survive Redis failures + policy ops	Fail-open/closed + circuit breaker + Pub/Sub	2 ms timeout inside 5 ms budget; 50 ms API SLA

Entity	In v1?	In v2?	What Triggered the Change
RatePolicy	Yes	Yes (+ failMode, localQuotaFraction)	Failure handling; two-tier local quota config
RateState	Yes	Yes	Unchanged — original design was right
HotKeyStat	No	Yes	Hot key detection in Step D
DecisionSample	Yes	Yes (+ tier, latencyMs)	Two-tier debugging; latency attribution

9. Common Mistakes

Choosing fixed window without discussing boundary burst. The 2× boundary exploit is the most well-known rate limiter flaw. If you pick fixed window without acknowledging it, the interviewer assumes you do not know.

Ignoring distributed over-admission. 20 instances × local limit = up to 20× over-admission without a global tier. This is the most common design gap.

Not quantifying Redis shard throughput. "We use Redis" is not a design. "We use 12 Redis shards, each handling ~120k Lua script ops/s, because 1M peak ÷ 120k = ~9 minimum" is a design.

No fallback when Redis is down. If your design's only answer to "what happens when Redis is slow?" is "requests wait," you have made the limiter a SPOF that is worse than having no limiter.

Logging every decision. At 1M decisions/s × 64 bytes/log, full logging generates ~64 GB/day. Sample.

10. What the Interviewer Is Actually Evaluating

Did you explain the algorithm mechanically? Not just "we use token bucket" but how tokens refill, how burst works, why it avoids boundary exploits, and why O(1) memory matters at 10M keys.

Did you map QPS to shard needs with numbers? 1M decisions/s ÷ 120k ops/s/shard = ~9 shards. This math, not the choice of Redis, is what the interviewer wants to see.

Did you prevent the limiter from becoming a bottleneck? The two-tier model (local + global), the 2 ms timeout, the circuit breaker, and the fail-mode configuration demonstrate that you designed the limiter to fail gracefully, not catastrophically.

Did you address the accuracy-availability tradeoff? Local counters trade accuracy for speed. Global counters trade speed for accuracy. The two-tier model gives you knobs to tune this tradeoff per policy. Showing awareness of this tradeoff — and giving operators control over it — is the signal the interviewer is looking for.

Final Thought

A rate limiter is a small system with outsized consequences. It has no UI, no user-facing features, no exciting product story. It is pure infrastructure — a function that returns true or false, called a million times per second.

But get it wrong, and your API collapses under a traffic spike that the limiter was supposed to prevent. Get it right, and nobody notices it is there. That invisibility is the measure of success.

We started with a token bucket and a single Redis instance. We ended with a two-tier distributed limiter with consistent-hash sharding, local burst absorption, per-endpoint fail-mode policies, circuit breakers, hot key isolation, and sampled observability. Not because we drew all those boxes on day one, but because each problem — throughput limits, distributed counting, Redis failures, hot keys — demanded exactly one targeted addition.

The same principle that has guided every system in this series: start simple, let numbers force decisions, and always know what breaks first.

Reading Progress

On This Page

Continue Learning

Infrastructure

PRO

Design an API Gateway

An API gateway looks like a simple reverse proxy. Route requests to backends. Add some authentication. Maybe rate limit. You could configure nginx in an afternoon.

Infrastructure

PRO

Design a Distributed Job Scheduler

A job scheduler looks like a simple queue with a clock. Jobs go in, the scheduler waits until the right time, jobs come out. You could build a working prototype with cron and a Redis list in an aft...

Infrastructure

PRO

Design a Feature Flag Platform

A feature flag platform looks simple. Store some boolean values, return them when asked. You could build a working prototype with a config file and an environment variable in minutes.

🎉 Launch Sale!

30% off annual plans with code LAUNCH30

View Pricing