Design a Rate Limiter
Design a Rate Limiter (Interviewer Walkthrough)
This problem is about protecting downstream services under load while preserving fair access. The key is not just "add Redis"; it is proving algorithm and storage choices with traffic numbers and limits.
---
1) Requirements (~5 min)
Functional requirements (top 3)
- Enforce per-key limits (user, API key, IP).
- Support burst handling and sustained-rate control.
- Return allow/deny decision in real time for each request.
Non-functional requirements
- Decision latency p95 < 5ms (service-side).
- High availability: limiter must not become SPOF.
- Accuracy close to configured policy.
- Horizontal scalability to high QPS.
- Tenant isolation (one noisy client should not affect others).
Scope control
- In scope: fixed-window, sliding-window, token-bucket style policies.
- Out of scope: billing, long-term user-facing analytics dashboards.
---
2) Capacity Estimation (only what influences design)
Assumptions
- Protected API traffic:
200k req/saverage - Peak multiplier:
5x->1M req/speak - Unique active keys at peak minute:
10M - Common policy example:
100 requests/min/key
Immediate conclusion
- Every API call needs one limit check.
- Limiter itself must handle up to
1M decisions/sat peak.
---
2.1) Capacity Estimation (Storage + Ops)
If using sliding-window logs naïvely:
1M req/s * 60s = 60M events/min- Even tiny per-event memory (
16 bytes) =>~960MB/minfor one minute of history. - Multi-minute windows quickly become expensive.
If using token bucket counters:
- One small state object per active key (tokens, lastRefillTs).
- For
10Mkeys and~64 bytes/key=>~640MBworking state (before overhead/replication).
Architecture implication:
- Counter/state-based limiter is more practical than storing per-request logs at this traffic level.
---
3) Core Entities (~2 min)
Core Entities v1
RatePolicy(keyType, limit, windowSec, burst, algorithm)RateState(key, tokens/currentCount, lastRefillTs/windowStart)DecisionLog(key, ts, allowed, policyId)(optional sampled logs)
Core entities are v1 draft; refine after algorithm choice and query patterns.
Thought process for selecting entities
- Start from critical query path:
- Query A: "For this request key, should I allow or deny now?" - Query B: "What policy applies to this key/route/tenant?" - Query C: "How do I debug why requests were denied?"
- Map each query to one entity:
- RatePolicy for configuration lookup. - RateState for mutable runtime counter/token state. - DecisionLog (sampled) for diagnostics and tuning.
- Keep v1 small:
- Store only fields required for decision correctness and observability. - Add advanced dimensions only when policy complexity demands them.
Functional requirement traceability:
FR1 (enforce per-key limits)->RatePolicy+RateStateFR2 (burst + sustained control)->RatePolicy.burst/rate/window+RateState.tokens/lastRefillTsFR3 (real-time allow/deny decisions)->RateStatehot path + optionalDecisionLogfor diagnostics
Why this mapping matters:
- It ties state model directly to limiter behavior, instead of generic policy tables.
Why these attributes in `RatePolicy`
keyType:
- Why: policy cardinality depends on identity boundary (userId, apiKey, ip). - What it does: determines how request identity is derived. - Why not hardcode one key type: different endpoints need different fairness models.
limitandwindowSec:
- Why: core throttle contract (e.g., 100 req / 60 sec). - What they do: define sustained request budget. - Why both: limit without window is ambiguous.
burst:
- Why: supports short spikes without immediate denial. - What it does: caps immediate token availability in token-bucket model. - Why not omit burst: many real clients are bursty (mobile retries, page fanout).
algorithm:
- Why: policies may vary by endpoint behavior. - What it does: enables token-bucket default with optional fixed/sliding variants. - Why not single hardcoded algorithm: reduces adaptability and experimentation.
Why these attributes in `RateState`
key:
- Why: runtime state must be partitioned by limited identity. - What it does: unique state record per entity being limited.
tokens/currentCount:
- Why: algorithm state needed to compute allow/deny. - What it does: tracks remaining budget. - Why one or the other: depends on algorithm choice.
lastRefillTs/windowStart:
- Why: time-aware refill/reset logic requires timestamp anchor. - What it does: computes replenishment and resets deterministically. - Why not stateless math only: current decision needs prior time context.
Why `DecisionLog` is optional and sampled
- Why optional:
- limiter correctness does not depend on full request logs.
- Why sampled:
- at 1M decisions/s, full logging is too expensive.
- What it does:
- provides enough evidence for debugging false throttles and hot keys.
---
4) API / Interface (~5 min)
- Internal call:
isAllowed(key, policyId, now) - Response:
{ allowed: true|false, remaining, resetAt }
Optional management APIs:
PUT /v1/policies/{id}GET /v1/policies/{id}
Why internal RPC style:
- Limiter is usually infra-sidecar/gateway middleware concern, not end-user API.
Thought process for API design
- Keep decision API extremely small:
- called on every protected request, so payload and compute must be minimal.
- Return enough metadata for client behavior:
- allow/deny + remaining budget + reset hint.
- Separate management APIs from runtime decision API:
- policy CRUD is control plane, isAllowed is data plane.
`isAllowed(key, policyId, now)` parameter reasoning
key:
- Why: identifies the throttle subject (user/apiKey/IP composite). - What it does: locates RateState. - Why not compute key inside limiter always: caller often has route/auth context needed for composite keys.
policyId:
- Why: endpoint/tenant may have different limits. - What it does: deterministic policy selection and versioning. - Why not infer from key alone: one key can hit different routes with different budgets.
now:
- Why: deterministic testing and distributed time handling. - What it does: time input for refill/window math. - Why not rely only on server clock: - you can still use server time in prod, but explicit param improves simulation/replay and unit tests.
Response field reasoning
allowed:
- Why: primary decision output. - What it does: caller either forwards or returns 429.
remaining:
- Why: enables client/gateway to expose budget hints and adaptive behavior. - What it does: remaining tokens/requests in current context.
resetAt:
- Why: supports deterministic backoff behavior. - What it does: tells client when meaningful retry is likely. - Why not just generic retry: - explicit reset guidance reduces wasteful immediate retries.
Management APIs thought process
PUT /v1/policies/{id}:
- Why: policy updates must be explicit and versionable. - What it does: control-plane mutation.
GET /v1/policies/{id}:
- Why: debugging and auditability for active rules. - What it does: read back effective config.
Error/response semantics (recommended)
429 Too Many Requestson deny path.- Include retry guidance (
Retry-Afteror equivalent reset hint). - Why:
- standard semantics simplify client backoff and interoperability.
Functional requirement to API mapping:
FR1->isAllowed(key, policyId, now)applies correct key policy.FR2-> responseremainingandresetAtexpose burst/sustained state clearly.FR3-> low-latency decision API supports per-request enforcement in gateway/middleware path.
Why this mapping matters:
- The single decision API is justified by all functional requirements and remains minimal.
---
5) High-Level Design (~10-15 min, progressive build)
Step A: Single-node limiter (correctness first)
For every protected request, the gateway/middleware calls isAllowed(key, policyId, now) and receives {allowed, remaining, resetAt}. The service must look up the policy, check or update the rate state, and return a decision. The question is: where do we store rate state so that all app instances see consistent limits?
Components:
- Limiter service
- In-memory policy cache
- Redis for shared state
Decision: Redis for state store
- Need: atomic updates at very high QPS across many app instances.
- Why Redis: single-threaded command execution + Lua scripts for atomic multi-step updates.
- Why not local memory only: inconsistent limits across app instances.
- Why not SQL row updates: lock contention and write amplification at
1M req/speak checks.
How with numbers:
- Peak checks:
1M/s. - If one Redis shard sustains
~120k ops/ssafely for script-based ops, need around9shards minimum; choose12for headroom. - This
120k ops/sis a planning assumption, not a universal constant; in real deployments you benchmark your command mix and payload size, then resize shards from measured p95 latency and saturation.
Step B: Choose algorithm by policy behavior
isAllowed must compute whether the current request fits within the policy's limit, windowSec, and burst parameters. Different algorithms produce different behavior at window boundaries and under bursty traffic. The question is: which algorithm gives the right burst tolerance and sustained-rate control?
Decision: token bucket for burst + sustained control (default)
- Need: allow short bursts but enforce average rate.
- Why token bucket: naturally models refill rate + burst capacity.
- Why not fixed window only: boundary spikes allow
2xbursts at window edges. - Why not sliding log window at this scale: memory/CPU heavy due to per-request timestamps.
Example:
- Policy:
100 req/min, burst20. - User sends 20 immediate requests -> allowed (bucket burst).
- Then sustained requests are throttled to refill pace.
How technically:
- State ops per check: read+compute+write atomically.
- Redis Lua script keeps operation linear and race-safe.
Step C: Prevent limiter from becoming SPOF
Every protected API call depends on isAllowed returning quickly. If the limiter backend (Redis) is slow or unavailable, the caller must decide whether to allow or deny the request. The question is: how do we keep the API available when the limiter itself is degraded?
Add:
- Client-side fallback policy
- Multi-AZ Redis deployment
- Timeout budget and circuit breaker in caller
Decision: fail-open vs fail-closed by endpoint criticality
- Need: availability + safety tradeoff.
- Why mixed mode:
- Auth/payment endpoints can fail-closed. - Read endpoints may fail-open with conservative local caps.
- Why not one global behavior: different business risk across endpoints.
How with numbers:
- Limiter timeout target:
2msbudget inside a50msAPI SLA. - If limiter dependency crosses
2msp95 consistently, caller degrades to local fallback.
Step D: Add observability + sampled logs
isAllowed decisions happen at 1M/s peak; operators need visibility into deny rates, hot keys, and limiter health without logging every decision. The question is: how do we make the system debuggable and tunable at this scale?
Add:
- Metrics: allow/deny rate, hot keys, Redis latency, script errors.
- Sampled decision logs (not full logs).
Why sampled logs:
- At
1M decisions/s, full logs are too costly. - Example: 1% sampling still gives
10k logs/s, enough for debugging trends.
---
Core Entities v2 (after evolution)
RatePolicy(policyId, keyType, algorithm, ratePerSec, burst, failMode)RateState(key, policyId, tokens, lastRefillTs, updatedAt)HotKeyStat(key, denyRate, shardId, minuteBucket)DecisionSample(key, policyId, ts, allowed, remaining)(sampled)
Why changed:
- Added
failModebecause endpoint risk differs. - Added
HotKeyStatfor skew detection and shard tuning.
---
6) Deep Dives (numeric)
Deep Dive 1: Fixed window boundary issue
Bad:
- Fixed window only (
100/min). - Boundary exploit: user can send 100 requests at
12:00:59and another 100 at12:01:00=> effectively 200 in ~2 seconds.
Good:
- Sliding window counter approximation.
Great:
- Token bucket with explicit burst control.
- Provides smoother enforcement and predictable behavior under bursty clients.
Deep Dive 2: Distributed consistency
Bad:
- Per-instance local counters only.
- With 20 app instances, each might allow up to limit independently (massive over-admission).
Good:
- Shared Redis state with atomic script.
Great:
- Shard by key + hot key mitigation + replication/failover.
Deep Dive 3: Hot key amplification
Bad:
- Single hot API key hammered at
200k req/son one shard. - If shard safe ops is
120k/s, overload leads to command queue growth and latency spikes.
Good:
- Detect hot keys and apply stricter local pre-check/backoff.
Great:
- Dedicated hot-key partitioning or token pre-allocation strategies for extreme tenants.
Deep Dive 4: Latency budget protection
Bad:
- No timeout/circuit breaker to limiter backend.
- App thread pool blocks on Redis latency spikes.
Good:
- 2ms timeout with bounded retries.
Great:
- Endpoint-specific fallback (fail-open vs fail-closed) + adaptive local guardrails.
Deep Dive 5: Policy update propagation
Bad:
- Policy updates only in DB, no cache invalidation strategy.
Good:
- Short TTL policy cache.
Great:
- Push invalidation (pub/sub) to limiter nodes for near-real-time policy consistency.
---
7) Research-backed architecture notes (industry alignment)
- Token bucket is widely used in managed API gateways
- AWS API Gateway documents throttling with token bucket behavior (rate + burst), which matches the burst + sustained control model used in this blog. - Why this matters: validates the default algorithm choice for real API traffic profiles.
- Use standard HTTP semantics for throttling responses
- RFC 6585 defines 429 Too Many Requests and allows Retry-After. - Why this matters: clients can implement predictable backoff behavior.
- Edge/gateway rate limiting is common in production
- Cloudflare and NGINX both document edge-layer/request-layer limiting patterns. - Why this matters: limiter near ingress protects expensive downstream compute/DB.
- Global + local limiter pattern is common
- Envoy documents global rate limiting service integration and combining local + global controls. - Why this matters: local limiter absorbs bursts; global limiter enforces cross-instance fairness.
- Atomic counter updates are required
- Redis rate-limiting guides recommend Lua scripting for atomic multi-step updates. - Why this matters: prevents race conditions when multiple app instances check/update same key concurrently.
References
- AWS API Gateway throttling docs: https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html
- RFC 6585 (429 Too Many Requests): https://www.rfc-editor.org/rfc/rfc6585
- Envoy global rate limiting overview: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/other_features/global_rate_limiting
- NGINX request limiting module: https://nginx.org/en/docs/http/ngx_http_limit_req_module.html
- Cloudflare rate limiting docs/overview: https://developers.cloudflare.com/waf/rate-limiting-rules/request-rate
- Redis rate limiting patterns: https://redis.io/learn/howtos/ratelimiting
---
8) Common mistakes candidates make
- Choosing fixed window without discussing boundary burst artifact.
- Ignoring distributed over-admission when using local counters.
- Not quantifying Redis/shard throughput needs.
- Logging every decision at peak scale.
- No fallback strategy when limiter backend is slow/down.
---
9) What interviewer is actually evaluating
- Did you pick algorithm based on behavior requirements (not popularity)?
- Did you map QPS to shard/latency limits numerically?
- Did you prevent limiter from becoming a new bottleneck?
- Did you explain endpoint-specific risk for fail-open/fail-closed?
Key Takeaways
- 1Enforce per-key limits (user, API key, IP).
- 2Support burst handling and sustained-rate control.
- 3Return allow/deny decision in real time for each request.
- 4Decision latency p95 < 5ms (service-side).
- 5High availability: limiter must not become SPOF.
Continue Learning
Infrastructure
PRODesign an API Gateway
An API gateway is a control point for routing, auth, retries, limits, and observability. The main risk is becoming both bottleneck and failure amplifier.
Infrastructure
PRODesign a Distributed Job Scheduler (Batch + Delayed + Recurring)
Schedulers fail when dispatch throughput is optimized but execution guarantees are vague: duplicate runs, starvation, runaway retries, and opaque recovery. This design focuses on deterministic sche...
Infrastructure
PRODesign a Feature Flag Platform (Evaluation + Rollout + Experimentation)
Feature flag systems fail when teams focus on toggles but ignore consistency, blast-radius control, and lifecycle hygiene. This design prioritizes safe progressive delivery with low-latency runtime...