Design a URL Shortener
Design a URL Shortener (Interviewer Walkthrough)
Most candidates fail this problem not because URL shortening is hard, but because they over-design too early or give choices without proving them. The goal is simple: build a working system first, then harden it with measurable reasoning.
---
1) Requirements (~5 min)
Functional requirements (top 3)
- Users can create short URLs from long URLs.
- Short URL redirects users to original URL.
- Users can view basic click count.
Non-functional requirements
- Redirect p95 < 50 ms.
- High availability on redirect path.
- Durable link mappings.
- Collision-safe short code generation.
- Abuse resistance.
Scope control
- In scope: shorten, redirect, basic analytics.
- Out of scope: billing, advanced campaign BI.
---
2) Capacity Estimation (only what influences design)
- Base62 space:
- 7-char = 62^7 = 3.5T - 8-char = 62^8 = 218T
- Decision: 8 chars default.
Why with numbers: with 50M new links/day, 8-char space keeps collision retries negligible for years; 6-char/7-char pressure rises earlier.
---
2.1) Capacity Estimation (Storage + Ops)
Assumptions
- New links/day:
50M - Redirects/day:
5B - Peak multiplier:
10x - Link record size (URL + metadata + index overhead):
~500B - Raw click event compressed:
~120B
Throughput
- Create writes:
50M/day = 579 writes/s avg,~5.8k writes/s peak - Redirect reads:
5B/day = 57.9k reads/s avg,~579k reads/s peak
Storage
- Link mapping:
50M * 500B = 25GB/day->~9.1TB/year(single copy) - With 3x replication:
~27TB/year - Raw click events:
5B * 120B = 600GB/day - 30-day raw retention:
~18TB(single copy),~54TBat 3x replication
Architecture forced by these numbers
579k peak read/s=> cache is mandatory.600GB/day click raw=> async pipeline + aggregation + TTL mandatory.5.8k peak create/s=> uniqueness + write distribution needed, not heavy joins.
---
3) Core Entities (~2 min)
Core Entities v1
Link(shortCode, longUrl, createdAt, expiresAt, status, ownerId?)ClickEvent(eventId, shortCode, ts, ipHash, ua, referrer?)ClickAggregate(shortCode, minuteBucket, clicks)
Core entities are v1 draft. Re-finalize after HLD + deep dives based on query patterns.
Thought process for selecting entities
- Start from primary queries, not from tables:
- Query A (redirect): "Given shortCode, fetch active destination quickly." - Query B (create): "Store mapping safely with uniqueness guarantees." - Query C (stats): "Show clicks quickly without scanning raw logs."
- Map each dominant query to an entity:
- Link serves Query A + B. - ClickEvent captures immutable raw activity for replay/audit. - ClickAggregate serves Query C at low latency.
- Keep v1 minimal:
- Add only attributes required by current requirements. - Delay extra attributes until a concrete query demands them.
Functional requirement traceability:
FR1 (create short URL)->Link(shortCode, longUrl, createdAt, status, expiresAt?)FR2 (redirect)->Link(shortCode -> longUrl)and validity fields (status,expiresAt)FR3 (basic click stats)->ClickEvent(ingest) andClickAggregate(serve)
Why this mapping matters:
- Every entity directly maps to at least one requirement, keeping schema design intentional.
Why these attributes in `Link`
shortCode:
- Why: primary lookup key for redirect path. - What it does: enables O(1)-style key access in KV-oriented storage. - Why not generated at read time: redirect must resolve deterministically.
longUrl:
- Why: final destination needed for redirect. - What it does: directly feeds Location header response. - Why not split table: extra joins/network calls on hottest path.
createdAt:
- Why: auditing, lifecycle policy, and operational debugging. - What it does: supports retention/age-based analysis.
expiresAt:
- Why: campaign/temporary links requirement. - What it does: single-read validity decision in redirect path. - Why not external policy service: adds extra hop to critical path.
status:
- Why: immediate kill switch for abuse/disabled links. - What it does: fast allow/block decision without deleting data. - Why not hard delete only: deletes lose forensic/ops context.
ownerId?:
- Why: optional ownership for future dashboard/governance. - What it does: enables per-owner listing/quota policies. - Why optional in v1: not needed to make redirect system work.
Why these attributes in `ClickEvent`
eventId:
- Why: idempotency key for replay-safe processing. - What it does: prevents double count on consumer retries.
shortCode:
- Why: aggregation dimension and join-free linkage to link. - What it does: bucketization key for stats workers.
ts:
- Why: time-window aggregates and trend charts. - What it does: minute/day partitioning input.
ipHash:
- Why: abuse heuristics with privacy-conscious storage. - What it does: rough uniqueness/anomaly detection without raw PII. - Why hash not raw IP: lower privacy and compliance risk.
ua,referrer?:
- Why: basic source/device analytics and fraud signals. - What it does: high-value dimensions for segmentation. - Why optional referrer: often unavailable or inconsistent.
Why `ClickAggregate(shortCode, minuteBucket, clicks)`
- Why: stats API needs fast reads at scale.
- What it does: avoids scanning raw
600GB/dayevent stream. - Why minute buckets:
- fine enough for trending, - compact enough for efficient storage and query.
- Why not only daily buckets: too coarse for near-real-time charts.
- Why not raw-only: expensive p95 and high compute cost.
---
4) API / Interface (~5 min)
POST /v1/linksGET /{shortCode}GET /v1/links/{shortCode}/stats
Thought process for API design
- API endpoints map 1:1 to core user actions:
- create, - resolve, - observe.
- Keep contract minimal in v1:
- avoid optional features unless they materially affect architecture.
- Protect hot path:
- redirect endpoint must stay lightweight and dependency-minimal.
`POST /v1/links` request parameters (why each exists)
longUrl(required):
- Why: essential input to system. - What it does: destination stored in Link.longUrl. - Validation: allow only safe schemes (http/https) and URL length guardrails.
customAlias(optional):
- Why: vanity links are common product need. - What it does: user-provided shortCode candidate. - Why optional: not required for core shortening; avoids blocking basic flow. - Guardrail: uniqueness reservation required to prevent alias races.
expiresAt(optional):
- Why: campaigns/temporary links. - What it does: sets redirect validity boundary. - Why request-level parameter: expiry intent belongs to link creation semantics.
`POST /v1/links` response fields (why each exists)
shortCode:
- Why: machine-friendly handle for future API calls. - What it does: direct key for stats/admin operations.
shortUrl:
- Why: user-friendly shareable output. - What it does: ready-to-use final link without client-side assembly.
`GET /{shortCode}` (redirect endpoint)
- Why path parameter:
- short code is canonical identifier in URL shortener products.
- What it does:
- resolves mapping and returns redirect response.
- Why no request body/query here:
- keeps cacheability and standard link behavior across browsers and bots.
Redirect status choice thought process
- Stable immutable links: usually permanent redirect semantics.
- Mutable/temporary links: temporary redirect semantics may be safer.
- Why policy-driven:
- status code influences caching and client behavior; choose consistently per product rule.
`GET /v1/links/{shortCode}/stats` parameters/response
shortCodein path:
- Why: aggregation key is link identity. - What it does: fetches compact aggregates, not raw events.
- Response
totalClicks:
- Why: minimal required metric in scope. - What it does: gives quick campaign health signal. - Why not return full breakdown by default: - adds query cost and payload bloat; - should be opt-in once product requires richer analytics.
Functional requirement to API mapping:
FR1->POST /v1/linksFR2->GET /{shortCode}FR3->GET /v1/links/{shortCode}/stats
Why this mapping matters:
- API surface is requirement-driven; no extra endpoints without clear functional purpose.
Error contract thought process (recommended)
400: invalid URL / invalid alias format.409: alias already taken.404or410: link missing or expired (policy-dependent).429: create endpoint throttled (abuse protection).
Why this set:
- clear client behavior,
- maps to standard semantics,
- avoids ambiguous retry patterns.
---
5) High-Level Design (~10-15 min, progressive build)
Step A: Create + Redirect baseline
Client calls POST /v1/links with longUrl. The service must generate a unique shortCode, persist the mapping durably, and return shortCode + shortUrl. On redirect, client hits GET /{shortCode}, and the service must resolve the mapping and return a redirect response. At ~579k peak reads/s and ~5.8k peak writes/s, the storage layer must handle high-throughput key lookups with low latency.
Components:
- API service
- Link service
- DynamoDB table (
PK = shortCode)
Tech decision: DynamoDB for Link mapping
- Need: single-key low-latency reads + horizontal scale + managed ops.
- Why DynamoDB: primary path is key lookup; no joins needed.
- Why not Postgres as primary: at
~579k peak reads/s, scaling reads with hot-key skew is harder operationally. - How with numbers:
- Peak create writes: ~5.8k writes/s. - DynamoDB partition write guidance is around ~1k WCU/partition, so writes naturally need spread; randomized shortCode distribution helps. - Redirect path is overwhelmingly key-value lookups, which matches DynamoDB's access model directly.
Example where this step is enough:
- Early product launch with
10k-30k rpsredirect traffic can run reliably on this base path before introducing additional complexity.
What can still break at this stage:
- Hot keys and global latency are not solved yet.
- Analytics is not yet decoupled.
Step B: Redirect latency optimization
When GET /{shortCode} is called, the service must resolve the mapping and return a redirect. At 579k peak reads/s, hitting DynamoDB for every request creates hot-key pressure and latency variance. The question is: how do we serve the redirect path faster while protecting the database?
Add:
- Redis cache (clustered)
- Optional CDN edge cache
Tech decision: Redis
- Need: protect DB from peak read load + hot links.
- Why Redis: in-memory reads and high QPS for hot keys.
- Why not DB-only: hotspot scenarios fail even with high total table throughput.
- Why not only CDN: app-layer validation/state checks still need controlled caching.
- How with numbers:
- Peak redirect load = 579k rps. - If cache hit = 99%, DB sees only ~5.79k rps. - If cache hit = 95%, DB sees ~28.95k rps.
Hotspot proof:
- Viral code at
100k rps. - DynamoDB per-partition read guidance roughly
~3k strong / ~6k eventual reads/s. - Without cache:
100k > 6k-> throttling + retries. - With 99% cache: DB sees
~1k rpsfor that key.
Operational behavior to call out in interview:
- Cache miss storms can happen if a hot key expires simultaneously across nodes.
- Add TTL jitter and single-flight refill so one miss refills cache and others wait briefly.
Example:
- A cricket final link expires from cache at 8:00:00 PM while traffic is
80k rps. - Without single-flight, thousands of concurrent misses hammer DynamoDB.
- With single-flight, only one backend refill is issued per cache node/key.
Step C: Analytics decoupling
Every GET /{shortCode} redirect generates a click event that feeds GET /v1/links/{shortCode}/stats. If we write analytics synchronously in the redirect path, latency spikes in the analytics store directly impact redirect SLA. The question is: how do we capture click data without coupling redirect latency to analytics health?
Add:
- Kafka
- Aggregation workers
- ClickHouse aggregate table
Tech decision: Kafka
- Need: bursty stream ingestion + replay + multi-consumer support.
- Why Kafka: partitioned log and offset replay.
- Why not SQS here: good queueing, less natural for replay-heavy streaming analytics.
- How with numbers:
- Peak events ~= 579k events/s. - If one partition handles ~5k events/s, need at least 116 partitions. - Use 128 partitions for headroom. - At 128 partitions, target average is ~4.5k events/s/partition at peak, leaving safety margin.
Tech decision: ClickHouse aggregates
- Need: fast stats reads without scanning raw events.
- Why ClickHouse aggregates: OLAP-friendly reads on precomputed buckets.
- Why not raw-query logs:
600GB/dayraw means expensive online scans.
How this helps redirect SLA:
- Redirect request returns as soon as mapping is resolved.
- Analytics delay is acceptable (eventual consistency) while redirect latency stays stable.
Example:
- Kafka consumer outage for 7 minutes causes analytics lag.
- Redirect success rate remains unchanged because redirect path does not depend on ClickHouse.
Step D: Product controls + lifecycle
POST /v1/links accepts optional expiresAt and customAlias, and GET /{shortCode} must validate link status before redirecting. The redirect path must handle expired, disabled, and abuse-flagged links gracefully. The question is: how do we enforce lifecycle policies and abuse controls without adding complexity to the hot redirect path?
- Expiration check
- Disabled status
- Alias reservation
- Retention jobs
How with numbers:
- Raw events at
600GB/day, 30-day retention =>18TBsingle copy. - TTL for raw data is mandatory to keep storage and cost bounded.
Additional policy details:
status=DISABLEDgives immediate emergency kill-switch for abuse links.expiresAtprevents stale campaign links from lingering indefinitely.- Custom aliases should use reservation records to avoid race conflicts on vanity names.
Example:
- Security team flags a phishing alias at 11:07 AM.
- Setting
status=DISABLEDblocks redirects immediately without waiting for TTL.
---
Core Entities v2 (after evolution)
Link(shortCode, longUrl, createdAt, expiresAt, status, ownerId, riskScore?)AliasReservation(alias, ownerId, status, createdAt)ClickEvent(eventId, shortCode, ts, ua, ipHash, referrer)ClickAggregate(shortCode, minuteBucket, clicks, uniqueApprox)AbuseDecision(entity, decision, reason, updatedAt)
Why changed: alias lifecycle, idempotent events, and abuse workflow become explicit once deeper requirements are addressed.
---
6) Deep Dives (numeric + mechanism)
Deep Dive 1: Code generation and collisions
Bad:
- Random code without DB uniqueness.
- Why this fails: app-level existence checks are race-prone under concurrent writes.
- Example: two create requests for different long URLs both generate
Ab3x9K7Q. - How: both app nodes check "not exists" before either write commits; last writer wins or one mapping is lost.
Good:
- Random Base62 + unique constraint + retry.
- If collision rate is
0.01%at5.8k writes/s, retries are~0.58/s(manageable). - Why this works: storage layer serializes uniqueness, so collisions become safe retries not data corruption.
Great:
- Monotonic ID -> Base62 (+ obfuscation).
- Near-zero collision retries and predictable create latency.
- Tradeoff: sequential IDs can leak creation volume patterns if not obfuscated.
Deep Dive 2: Hot keys
Bad:
- DB-only reads.
- Viral key
100k rpsvs~6k eventual reads/shot partition guidance -> retry storm. - Why this fails: hotspot traffic is concentrated on one partition key path.
- Example: major event link posted by large influencer account.
- How:
- 100k rps incoming for one code. - DynamoDB serves about ~6k rps for that hot path (eventual reads). - ~94k rps get throttled/retried, causing exponential pressure and client timeout amplification.
Good:
- Redis cache.
- 99% hit on hotspot => DB sees
~1k rps. - Why this works: repeated identical lookups are served from memory in microseconds to low milliseconds.
Great:
- CDN + Redis + single-flight refill to avoid thundering herd.
- Tradeoff: invalidation complexity increases when links are mutable.
Deep Dive 3: Analytics path
Bad:
- Sync analytics write.
- Extra
40msp95 in analytics can violate<50msredirect target. - Why this fails: non-critical system sits directly in critical request path.
- Example: ClickHouse insert latency spikes during compaction.
- How: redirect p95 becomes
mapping_lookup + analytics_write + network jitter, easily exceeding SLA.
Good:
- Async publish + worker aggregation.
- Why this works: redirect response is no longer blocked by analytics health.
Great:
- Idempotent consumers by
eventIdfor replay safety. - Example: worker restarts and reprocesses batch offsets.
- How: dedupe on
eventIdprevents double-counting.
Deep Dive 4: Abuse controls
Bad:
- No validation/rate limit.
- Bot can create
10k links/min. - Why this fails: short domains are high-value for phishing.
- Example: attacker creates thousands of lookalike login links in minutes.
- How: abuse campaigns raise reports, domain reputation drops, and major providers may block all links on your domain.
Good:
- Token bucket + scheme/domain validation.
- Why this works: blocks obvious unsafe schemes and high-rate bot behavior early.
Great:
- Risk score + quarantine workflow.
- Tradeoff: false positives require moderation tooling and review ops.
Deep Dive 5: Reliability
Bad:
- Single region.
- Why this fails: one region outage becomes global outage.
- Example: region-level networking issue or control-plane incident.
- How: if all redirects depend on one region, success rate can collapse for all geographies.
Good:
- Multi-region reads + replica/failover.
- Why this works: users are served from nearest healthy region, reducing RTT and regional blast radius.
- Example: APAC users avoid cross-ocean round trip when local region is healthy.
- Numeric effect: regional routing often saves
100ms+RTT versus cross-ocean requests.
Great:
- Graceful degradation: redirect path remains healthy even if analytics components fail.
- Why this is critical: redirect is the core product contract; analytics is secondary.
- Example: Kafka brokers degraded for 15 minutes.
- How:
- Redirect path uses cache + mapping store only. - Analytics path can backlog and recover later. - Error budget is preserved for core path by isolating non-critical dependencies.
---
7) Common mistakes candidates make
- Not choosing explicit DB/cache/bus.
- "How" without numbers.
- Ignoring hot partition math.
- Querying raw analytics for online APIs.
- Never revisiting entities after deep dives.
- Coupling redirect success to analytics or moderation pipelines.
- Forgetting explicit failure policy for cache miss storms and regional failover.
---
8) What interviewer is actually evaluating
- Did you map estimated load to system limits?
- Did you justify tech choices with why/why-not?
- Did you evolve data model as access patterns became clearer?
- Did you isolate critical path from non-critical failures?
- Did you explain what breaks first and how your design recovers?
---
9) Research-backed notes (industry alignment)
- Permanent redirects are commonly used in shorteners when destination is stable.
- Open redirect and malicious destination handling is a real security concern for short-link products.
- Event-driven analytics separation is common for high-QPS user-facing systems.
- Global traffic systems typically combine edge caching and regional backends to reduce latency and improve resilience.
These patterns align with common production practice and reduce interview risk by grounding decisions in real-world behavior, not just textbook architecture.
Key Takeaways
- 1Users can create short URLs from long URLs.
- 2Short URL redirects users to original URL.
- 3Users can view basic click count.
- 4Redirect p95 < 50 ms.
- 5High availability on redirect path.
Continue Learning
Video
PRODesign a TikTok-like Short Video Platform
Short video platforms look like simple video hosting, but the real system problem is sub-200ms time-to-first-frame for infinite scroll, hyper-personalized recommendations that update in real-time b...
Web Services
Design a URL Shortener
Most candidates fail this problem not because URL shortening is hard, but because they over-design too early or give choices without proving them.
Notifications
Design a Notification System
Goal: deliver notifications across multiple channels (push/email/SMS) reliably, with user preferences and retries.