Design a URL Shortener
The Deceptive Simplicity of a Short Link
A URL shortener looks trivial. Accept a long URL, return a short one, redirect anyone who clicks it. You could build a working prototype in an afternoon with a hash map and a web server.
But the moment you ask "what happens when 500,000 people click that link in the same second?" β the simplicity evaporates. Suddenly you are dealing with hot-key database pressure, cache stampedes, abuse vectors that can get your entire domain blacklisted, and an analytics firehose producing 600 GB of click data every day.
This post walks through designing a production URL shortener from first principles. Every decision β from the database to the cache layer to the analytics pipeline β will be driven by numbers, not instinct. We will start with a minimal working system and progressively harden it, exactly the way you would in a real engineering design review or a system design interview.
If you are reading this for the first time, think of it as a story. Each section builds on the last. The requirements tell us what to build. The capacity math tells us how hard it will be. The entities and APIs give us the vocabulary. The high-level design assembles the pieces. And the deep dives stress-test the weak points. Skip a chapter, and the next one will feel like you walked into a movie halfway through.
Let us begin.
1. Requirements β What Exactly Are We Building?
Before we draw a single box on an architecture diagram, we need to agree on what the system must do β and equally important, what it must not do. Scope creep is where system designs go to die.
Functional Requirements
Our URL shortener needs to do three things well:
- Create short URLs. A user submits a long URL and receives a compact, shareable short link in return.
- Redirect. When someone clicks that short link, the system resolves it to the original URL and sends them there β fast.
- Basic click analytics. The link creator can see how many times their link has been clicked.
That is it. Three user journeys. Everything in our design exists to serve one of these three.
Non-Functional Requirements
The functional requirements tell us what to build. The non-functional requirements tell us how well it must work:
- Redirect latency: p95 under 50 ms. The redirect path is the product. If clicking a short link feels slow, nobody will use the service. Fifty milliseconds at the 95th percentile means even under load, the vast majority of clicks resolve almost instantly.
- High availability on the redirect path. A link that does not resolve is worse than no link at all. The redirect path must stay healthy even when other parts of the system β analytics, admin dashboards β are degraded.
- Durable link mappings. Once a short link is created, the mapping must not be lost. A broken link that worked yesterday destroys user trust permanently.
- Collision-safe code generation. Two different long URLs must never produce the same short code. This sounds obvious, but at scale, it becomes a real engineering challenge.
- Abuse resistance. Short URL domains are high-value targets for phishing campaigns. Without controls, an attacker can generate thousands of malicious links in minutes.
Scope Control
To keep this design focused, we are explicitly drawing a boundary:
In scope: shorten, redirect, basic click count.
Out of scope: billing, advanced campaign analytics, A/B test bucketing, team management.
We can always add these later. But if we try to design for them now, we will over-engineer the core and under-deliver on what matters.
Now that we know what we are building and how well it needs to work, the next question is: how much load are we actually talking about? Without numbers, every architecture decision is just an opinion.
Login to continue reading
You reached the preview limit. Sign in to unlock the remaining sections.
2. Capacity Estimation β The Numbers That Shape Every Decision
Capacity estimation is not a ritual you perform to impress an interviewer. It is the tool that tells you which problems are real and which are imaginary. A system handling 100 requests per second has very different needs than one handling 500,000. The numbers we calculate here will directly force specific architecture choices β and we will call those out explicitly as we go.
How Much Short Code Space Do We Need?
A short URL is only useful if its code is compact. Let us think about how much space different code lengths give us, using Base62 encoding (lowercase + uppercase letters + digits):
Why Base62 specifically? A natural first instinct is to hash the long URL (MD5, SHA-256) and truncate the hash to the desired length. The problem is that truncated hashes have no collision guarantees β you are chopping a uniformly distributed hash into a small window, and two completely different URLs can produce the same truncated prefix. You still need a uniqueness check and retry logic, but now you have also lost the ability to control the code space precisely. Worse, hashing the URL means the same long URL always produces the same short code β which sounds like a feature until two different users submit the same URL and expect independent links with separate analytics, separate expiry dates, and separate ownership. Generating codes independently from the URL (random or sequential) avoids this coupling entirely.
Why not Base64? Base64 includes + and /, which are special characters in URLs and need percent-encoding. A short URL like short.ly/Ab+x/K7Q breaks in many contexts β chat apps, markdown, CSV files. Base62 (a-z, A-Z, 0-9) is URL-safe by construction, with no encoding surprises.
- 7 characters: 62β· = roughly 3.5 trillion possible codes
- 8 characters: 62βΈ = roughly 218 trillion possible codes
Decision: we will use 8-character codes by default.
Why not 7? With 50 million new links created per day (we will justify this number shortly), a 7-character space still offers trillions of codes β plenty in theory. But shorter code spaces mean collisions become statistically likely sooner, which means more retry logic, more database contention, and more operational headaches. The extra character costs us almost nothing in URL length but buys us years of breathing room before collision probability becomes worth worrying about.
Throughput β How Many Operations Per Second?
Here are our traffic assumptions:
- New links created per day: 50 million
- Redirects per day: 5 billion (a 100:1 read-to-write ratio, which is typical for link shorteners β links are created once and clicked many times)
- Peak multiplier: 10Γ average (major events, viral content, marketing campaigns can spike traffic dramatically)
Let us convert these to per-second numbers, since that is what our infrastructure actually needs to handle:
Create (write) path:
- Average: 50M Γ· 86,400 seconds β 579 writes/s
- Peak: 579 Γ 10 = ~5,800 writes/s
Redirect (read) path:
- Average: 5B Γ· 86,400 β 57,900 reads/s
- Peak: 57,900 Γ 10 = ~579,000 reads/s
Pause on that peak redirect number. Nearly 580,000 read operations per second. This is not a "maybe we need a cache" situation β this is a "the design does not work without a cache" situation. We will come back to this.
Storage β How Much Data Per Day, Per Year?
Every link mapping needs to store the short code, the destination URL, timestamps, status flags, and index overhead. Let us estimate the size: shortCode (8 bytes), longUrl (average ~200 bytes for a typical destination with query parameters), two timestamps (createdAt + expiresAt, 16 bytes), status and flags (~10 bytes), ownerId (36 bytes for a UUID), plus DynamoDB item overhead, attribute names, and secondary index entries (~230 bytes). That gives us roughly 500 bytes per link record β a number we will use for all storage calculations.
Link mapping storage:
- Per day: 50M links Γ 500 bytes = 25 GB/day
- Per year: 25 GB Γ 365 = ~9.1 TB/year (single copy)
- With 3Γ replication: ~27 TB/year
Every redirect also generates a click event β who clicked, when, from where, what browser. Compressed, each event is roughly 120 bytes.
Click event storage:
- Per day: 5B events Γ 120 bytes = 600 GB/day
- 30-day raw retention: ~18 TB (single copy), ~54 TB with replication
What These Numbers Force Us To Build
These are not just interesting facts. They are architecture mandates:
579,000 peak reads per second means a cache layer is mandatory. No single database, no matter how well-tuned, should be the sole line of defense against that read volume β especially when viral links concentrate traffic on a handful of keys.
600 GB of raw click events per day means we cannot store analytics in our primary database and query it on the fly. We need an asynchronous pipeline that ingests events, aggregates them, and serves pre-computed statistics. Trying to scan 600 GB of raw events to answer "how many clicks did this link get?" would be absurdly expensive.
5,800 peak writes per second means our write path needs good key distribution and uniqueness guarantees, but does not need the kind of heavy sharding you would design for millions of writes per second.
With these numbers in hand, we now know the shape of our problem. The next step is to define the data we need to store β not by guessing at tables, but by working backward from the queries these numbers imply.
3. Core Entities (v1) β Modeling Data Around Access Patterns
Most engineers start data modeling by thinking about tables. That instinct is backwards. We should start from the queries our system needs to answer, and then design entities that serve those queries efficiently.
Our three functional requirements imply three dominant query patterns:
- Query A β Redirect: "Given a short code, fetch the active destination URL as fast as possible." This is the hottest path in the system β 579,000 times per second at peak.
- Query B β Create: "Store a new short-code-to-URL mapping with uniqueness guarantees." This must be safe against collisions.
- Query C β Stats: "For a given short code, show click counts quickly without scanning raw logs."
Each of these queries maps to an entity. Let us define them.
Link
Link(shortCode, longUrl, createdAt, expiresAt, status, ownerId?)This is the heart of the system. Every redirect starts here. Let us walk through why each field exists:
shortCode is the primary lookup key. When someone clicks https://short.ly/Ab3x9K7Q, the system extracts Ab3x9K7Q and uses it to find the destination. This field needs to support fast, single-key lookups β O(1)-style access. This is a strong hint that our storage layer should be key-value oriented, not relational.
longUrl is the destination. When the redirect service resolves a short code, this is what goes into the HTTP Location header. We store it alongside the short code rather than in a separate table because adding a join to the hottest path in the system (579,000 reads/s at peak) would be reckless.
createdAt supports auditing, retention policies, and operational debugging. When something goes wrong at 3 AM, "when was this link created?" is always one of the first questions.
expiresAt enables campaign and temporary links. Rather than running a complex external policy service, the redirect path can check this single field to decide if a link is still valid β one read, one comparison, done.
status is an immediate kill switch. If the security team flags a phishing link, setting status=DISABLED blocks all redirects for that code instantly β without deleting the record (which would lose forensic evidence). This is simpler and faster than routing through an external moderation service on every redirect.
ownerId (optional) exists for future features like dashboards and per-owner quotas. It is optional because the core redirect path does not need it. We include it now to avoid a painful migration later, but we do not let it complicate the v1 design.
ClickEvent
ClickEvent(eventId, shortCode, ts, ipHash, ua, referrer?)Every redirect generates one of these. They are the raw material for analytics.
eventId is an idempotency key. If a downstream consumer crashes and replays a batch of events, this ID lets us deduplicate and avoid double-counting. Without it, every consumer restart risks corrupting our statistics.
shortCode ties the event to its link. This is also the key we will use to aggregate events into per-link statistics.
ts (timestamp) enables time-windowed aggregations β clicks per minute, per hour, per day. Without it, we can count total clicks but cannot show trends.
ipHash stores a hashed version of the client IP. We hash rather than store raw IPs because privacy regulations make raw IP storage a compliance headache. The hash is sufficient for basic abuse detection β spotting suspiciously high click volumes from a single source β without the legal exposure.
ua (user agent) and referrer provide basic device and source analytics. Referrer is optional because many browsers and privacy tools strip it.
ClickAggregate
ClickAggregate(shortCode, minuteBucket, clicks)This entity exists for one reason: the stats API needs to be fast, and scanning 600 GB of raw events per day to answer "how many clicks?" is not fast. By pre-aggregating clicks into minute-level buckets, the stats endpoint can read a compact summary table instead of crunching raw logs.
Why minute buckets? They are fine-grained enough to power near-real-time trend charts, but compact enough that storage and query costs stay reasonable. Daily buckets would be too coarse for live dashboards. Raw-only storage would be too expensive to query online.
Connecting Entities to Requirements
Every entity should trace back to a requirement. If it does not, it should not exist yet.
- FR1 (create short URL) β
Linkstores the mapping withshortCode,longUrl,createdAt,status, and optionalexpiresAt. - FR2 (redirect) β
Linkprovides the lookup viashortCode β longUrl, withstatusandexpiresAtfor validity checks. - FR3 (basic click stats) β
ClickEventcaptures raw activity;ClickAggregateserves pre-computed counts.
No orphan entities, no speculative fields. Everything here earns its place by serving a specific query pattern. That said, these entities are a v1 draft. As we build out the high-level design and encounter new requirements β alias reservations, abuse workflows β the model will evolve. We will revisit it explicitly after the architecture is in place.
With our data model defined, we are ready to design the contracts that external clients will use to interact with the system.
4. API Design β The Contracts Between Users and System
Now that we know what data the system stores and which queries it must answer, we can define the interface that exposes these capabilities to the outside world. A well-designed API maps cleanly to user actions: create, resolve, observe. Ours has exactly three endpoints β one per functional requirement.
POST /v1/links β Create a Short URL
This endpoint accepts a long URL and returns a short one. Here is what the request looks like:
Request parameters:
longUrl (required) β the destination URL. This is the essential input to the entire system; without it, there is nothing to shorten. We validate that the scheme is http or https (blocking javascript:, data:, and other potentially dangerous schemes) and enforce a maximum URL length to prevent abuse.
customAlias (optional) β a user-provided vanity code like promo2024. This is optional because the core shortening flow generates codes automatically; vanity links are a product convenience, not a core requirement. When provided, the system must reserve the alias atomically to prevent two users from claiming the same vanity code in a race condition.
expiresAt (optional) β a timestamp after which the link should stop working. This supports campaign and temporary links. We accept it at creation time because expiry is a property of the link itself, not something that should be managed by a separate policy system.
Response fields:
shortCode β the machine-friendly identifier (e.g., Ab3x9K7Q). Clients use this for subsequent API calls like fetching stats.
shortUrl β the full, ready-to-share URL (e.g., https://short.ly/Ab3x9K7Q). We return this so clients do not need to assemble the URL themselves, reducing integration errors.
GET /{shortCode} β Redirect
This is the hottest endpoint in the system. When someone clicks a short link, their browser sends a GET request with the short code in the path. The system resolves the mapping and returns a redirect response.
The short code lives in the URL path (not as a query parameter) because this is the universal convention for URL shorteners β it keeps links clean and compatible with every browser, bot, and social media platform.
There is no request body and no query parameters. This is intentional: a redirect must be as lightweight and cacheable as possible. Any complexity here directly threatens our 50 ms p95 latency target.
Redirect status code decision: 302 (temporary redirect) by default.
This is a choice you must commit to, not defer. We choose 302 for a specific reason: our system supports link expiration (expiresAt), link disabling (status=DISABLED), and potential future destination updates. A 301 (permanent redirect) tells the browser "this destination will never change β cache it forever and never ask me again." That means if we disable a phishing link with 301 semantics, browsers that have already cached the redirect will continue sending users to the phishing page and never check back with our server. With 302, every click goes through our service, giving us a live enforcement point for expiration, abuse controls, and status checks.
The tradeoff is measurable: 302 means browsers always hit our servers, which increases redirect traffic. But that traffic is exactly what our caching layer (Redis + optional CDN) is designed to absorb. The control we gain β the ability to disable any link instantly for every user β is worth far more than saving a few cache hits.
When would 301 make sense? For a product that only serves permanent, immutable links with no expiration or abuse workflow β a rare scenario. For most shorteners, 302 is the safer default.
GET /v1/links/{shortCode}/stats β View Click Statistics
This endpoint returns aggregated click data for a given link. The short code in the path identifies which link we are asking about β it is the natural key and the same identifier used everywhere else in the system.
The response includes totalClicks β the minimum viable metric that satisfies our FR3 requirement. We deliberately do not return a full breakdown by device, geography, or referrer in the default response. That additional detail increases query cost and payload size, and our requirements say "basic click count," not "full analytics dashboard." When the product needs richer analytics, we can add optional query parameters to request specific dimensions.
Error Semantics
A good API communicates failure as clearly as success. Here is our error contract:
- 400 β the submitted URL is invalid (bad scheme, too long, malformed) or the custom alias format is rejected.
- 409 β the requested custom alias is already taken. The client knows to try a different alias without guessing.
- 404 or 410 β the short code does not exist, or the link has expired. Whether we return 404 (not found) or 410 (gone) for expired links is a product decision β 410 explicitly tells clients "this existed but is permanently unavailable," which can be helpful for debugging.
- 429 β rate limit exceeded on the create endpoint. This is our first line of defense against abuse bots.
Mapping APIs to Requirements
Just as we traced entities to requirements, let us confirm that each endpoint earns its place:
- FR1 (create short URL) β
POST /v1/linksβ accepts long URL, returns short URL. - FR2 (redirect) β
GET /{shortCode}β resolves mapping, returns redirect. - FR3 (basic click stats) β
GET /v1/links/{shortCode}/statsβ returns aggregated click count.
Three requirements, three endpoints. No extra surface area without a clear purpose.
With contracts defined, we know exactly what the system must do at each boundary. Now we can build the architecture that implements these contracts β starting with the simplest possible working system and adding complexity only when the numbers demand it.
5. High-Level Design β Building the Architecture Step by Step
This is where most system designs go wrong. Engineers draw the final, fully-optimized architecture on the first pass β CDN, cache, message queue, analytics warehouse, the works β without explaining why each component exists. The result looks impressive but teaches nothing.
We are going to do the opposite. We will build four progressive versions of the architecture, adding a component only when a specific, measurable problem forces us to. By the end, you will not just know what the architecture looks like β you will understand the pain that each piece was designed to solve.
Step A: The Minimal Working System
Let us start with the smallest architecture that satisfies our three functional requirements: create a short URL, redirect to the original, and record click data.
ββββββββββββ ββββββββββββββββ βββββββββββββ
β Client ββββββΆβ API Service ββββββΆβ DynamoDB β
β βββββββ (Link svc) βββββββ (Links) β
ββββββββββββ ββββββββββββββββ βββββββββββββ
Create: Client ββPOST /v1/linksβββΆ API ββputβββΆ DynamoDB
Redirect: Client ββGET /{code}βββββΆ API ββgetβββΆ DynamoDB ββ302βββΆ ClientThe flow is straightforward. A client calls POST /v1/links with a long URL. The API service receives the request, passes it to the Link service, which generates a unique 8-character Base62 short code, writes the mapping to storage, and returns the short URL. On redirect, a client hits GET /{shortCode}, the service looks up the mapping in storage, and returns a redirect response.
One important design principle underpins everything that follows: the API service is stateless. It holds no link data in memory, no session state, no local caches that differ between instances. All state lives in DynamoDB. This means we can run ten instances or a hundred instances behind a load balancer, and any instance can serve any request. Horizontal scaling on the compute layer is just a matter of adding more instances β no coordination, no sticky sessions, no distributed locking between app servers.
For storage, we need a system that excels at single-key lookups β because that is what the redirect path does millions of times a day. We are not joining tables, not running complex queries, not aggregating across relationships. We are asking one question: "given this short code, what is the destination URL?"
Tech decision: DynamoDB for the link mapping.
DynamoDB fits this access pattern naturally. Its primary path is key-value lookup β exactly what our redirect needs. It scales horizontally as traffic grows, and as a managed service, it removes the operational burden of running and maintaining database clusters.
You might wonder: why not PostgreSQL? Postgres is an excellent database, and it could handle our write volume (5,800 peak writes/s) without breaking a sweat. The problem is the read pattern. At 579,000 peak reads per second β with viral links concentrating traffic on a handful of keys β scaling Postgres read replicas becomes an operational challenge. Connection pool exhaustion, buffer cache churn on hot rows, and replica lag all become real concerns. DynamoDB's partition-level scaling model handles hot keys more naturally because each short code hashes to its own partition space.
One important detail: DynamoDB's write guidance is roughly 1,000 write capacity units per partition. At 5,800 peak writes/s, we need writes to distribute across partitions. Our randomly generated Base62 codes provide this distribution naturally β random codes spread uniformly across DynamoDB's hash space.
What this step gives us: a working system that can handle moderate traffic β say, 10,000 to 30,000 requests per second β reliably. For an early product launch, this is enough.
What can still break: we have no protection against hot keys at peak load. We have no decoupled analytics. And we have no lifecycle controls. Let us fix the most dangerous gap first.
Step B: Protecting the Redirect Path with Caching
Our minimal system works, but it has a fragile dependency: every single redirect hits DynamoDB directly. At average load, this is fine. But remember our peak numbers β 579,000 reads per second. And viral links make this worse: a single popular short code might receive 100,000 requests per second, all hitting the same DynamoDB partition.
DynamoDB's per-partition read guidance is roughly 3,000 strong-consistency reads/s or 6,000 eventually-consistent reads/s. A viral link at 100,000 rps exceeds that by more than 16Γ. The result? Throttling, retries, exponential backoff, and cascading latency spikes. The redirect path β our core product β becomes unreliable exactly when it matters most.
The fix is a caching layer that absorbs the read load before it reaches the database.
ββββββββββββ ββββββββββββββββ βββββββββββ βββββββββββββ
β Client ββββββΆβ API Service ββββββΆβ Redis ββββββΆβ DynamoDB β
β βββββββ (stateless) βββββββ (cache) β β (Links) β
ββββββββββββ ββββββββββββββββ βββββββββββ βββββββββββββ
β HIT βΆ return immediately
β MISS βΆ fetch from DynamoDB, fill cache, return
Optional: CDN edge cache sits in front of API Service for popular linksTech decision: clustered Redis cache.
Redis serves in-memory reads at sub-millisecond latency and can handle hundreds of thousands of operations per second per node. By placing Redis between the API service and DynamoDB, we turn the vast majority of redirect lookups into cache hits that never touch the database.
Why Redis over Memcached? Both are fast in-memory stores, and for pure key-value caching, Memcached would work. But we already know from Step D (coming shortly) that we will need Pub/Sub for cache invalidation when links are disabled. Redis gives us the cache and the invalidation channel in one system. Memcached has no pub/sub mechanism, so choosing it would mean running a second system (like a separate Redis instance or SNS) just for invalidation β more operational surface for no performance benefit on the caching side.
How effective does the cache need to be? Let us do the math:
- At 99% cache hit rate: DynamoDB sees only 1% of peak traffic = ~5,790 reads/s. Comfortable.
- At 95% cache hit rate: DynamoDB sees 5% = ~28,950 reads/s. Still manageable but with less headroom.
- Without cache: DynamoDB absorbs all 579,000 reads/s. Unsustainable.
For our viral link scenario: at 100,000 rps for a single code, a 99% cache hit rate means DynamoDB sees ~1,000 rps for that key β well within partition limits. Without the cache, 100,000 rps against a single partition exceeds guidance by 16Γ, leading to throttling and cascading failures.
But caching introduces a new risk: the thundering herd.
Imagine a cricket World Cup final link cached across multiple Redis nodes. The cache entry expires at exactly 8:00:00 PM while traffic is at 80,000 rps. In the milliseconds before any node refills its cache, thousands of simultaneous cache misses all hit DynamoDB at once β a sudden spike that can overwhelm the partition.
The solution is two-fold. First, TTL jitter: instead of setting every cache entry to expire at exactly the same time, we add a small random offset (e.g., Β±30 seconds) so entries expire gradually rather than all at once. Second, single-flight refill: when a cache miss occurs, only one request per cache node is allowed to fetch from the database; all other concurrent misses for the same key wait briefly for that one fetch to complete. This collapses thousands of potential database hits into a single read per node.
We can also add an optional CDN edge cache for the most popular links, which pushes resolution even closer to users geographically. But the CDN alone is not sufficient β application-layer validation (checking link status, expiration) still needs the Redis + DynamoDB path. The CDN is an optimization on top, not a replacement.
What this step gives us: a redirect path that stays fast and stable even during viral traffic spikes. The database is protected. Latency is dominated by cache hits.
What is still missing: every redirect generates a click event, and we have not decided where those events go. If we write them synchronously in the redirect path, we have coupled our 50 ms latency target to the health of the analytics system. That is the next problem to solve.
Step C: Decoupling Analytics from the Critical Path
Here is the tension: every redirect must record a click event (for our stats endpoint), but writing that event to a database takes time. If the analytics store is slow β maybe ClickHouse is running a compaction, or network latency spikes β that slowness directly increases redirect latency. Our redirect SLA (p95 < 50 ms) is now held hostage by a system that serves a secondary requirement.
This is a classic design mistake: coupling the critical path to non-critical dependencies. The redirect must succeed in under 50 ms. The click count can be a few seconds (or even minutes) behind real-time and nobody will notice.
The solution is to make analytics recording asynchronous. The redirect service fires a click event into a durable message stream and immediately returns the redirect response. A separate set of workers consumes events from the stream, aggregates them, and writes the results to an analytics store.
REDIRECT PATH (latency-critical):
ββββββββββββ ββββββββββββββββ βββββββββββ βββββββββββββ
β Client ββββββΆβ API Service ββββββΆβ Redis ββββββΆβ DynamoDB β
β ββ302ββ β βββββββββββ βββββββββββββ
ββββββββββββ β β
β β Response sent
β β THEN publish (non-blocking, post-response)
ββββββββ¬ββββββββ
β async
βΌ
ANALYTICS PATH (latency-tolerant):
ββββββββββββββββ ββββββββββββββββ ββββββββββββββ
β Kafka ββββββΆβ Aggregation ββββββΆβ ClickHouse β
β (128 parts) β β Workers β β (agg table)β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββ
β
βΌ
GET /v1/links/{code}/statsTech decision: Kafka for the event stream.
At peak, we are generating roughly 579,000 click events per second. This is a firehose of data that needs to be ingested reliably, consumed by multiple downstream processors, and replayed if anything goes wrong.
Kafka is purpose-built for this pattern. Its partitioned log model distributes load across brokers, and its offset-based consumption model means a consumer that crashes can pick up exactly where it left off β replay is a first-class feature, not an afterthought.
Why not something simpler like SQS? SQS is excellent for job queues, but our analytics pipeline benefits from Kafka's specific strengths: ordered processing within partitions, consumer group rebalancing for scale-out, and offset replay for reprocessing. When your event volume is 600 GB/day and you need to re-aggregate a corrupted hour of data, Kafka's replay capability saves you from building a separate recovery system.
Another alternative worth addressing: why not skip the network hop entirely and write click events to a local disk buffer (a write-ahead log on each API instance), then have a local agent ship them to the analytics store? This approach reduces network dependencies β if the analytics backend is down, events accumulate safely on local disk. The problem is operational: our API service instances are stateless by design. Local disk buffers introduce state on the compute tier, which means instance termination (auto-scaling, deployments, spot reclamation) risks losing buffered events. It also means every API instance now runs a shipping agent that needs monitoring, and aggregation must merge streams from dozens of instances instead of consuming from one centralized log. Kafka centralizes the stream, provides a single replay point, and keeps compute instances stateless β which is worth the network hop.
How many Kafka partitions do we need? If a single partition comfortably handles roughly 5,000 events/s, then at 579,000 peak events/s, we need at least 116 partitions. We round up to 128 partitions for headroom, giving us an average of ~4,500 events/s per partition at peak β a comfortable margin below the per-partition limit.
But what happens when a broker fails? If our 128 partitions are spread across 8 brokers (16 partitions each), and one broker goes down, its 16 partitions get reassigned to the remaining 7 brokers. Those brokers now handle ~18.3 partitions each, and peak load per partition stays around ~4,500 events/s β still within limits. If two brokers fail simultaneously, the remaining 6 brokers carry ~21 partitions each. This is tighter but still workable because our 128-partition choice gave us 10% headroom over the minimum 116. This is exactly why we rounded up: the headroom is not wasted capacity, it is broker-failure budget.
Tech decision: ClickHouse for aggregated analytics.
The aggregation workers consume events from Kafka, bucket them into minute-level aggregates by short code, and write the results to ClickHouse β a column-oriented OLAP database designed for exactly this kind of fast analytical reads on pre-aggregated data.
Why ClickHouse and not Postgres? Postgres could certainly store aggregates and serve stats queries. But our aggregation pattern β writing millions of small rows per day, then reading narrow column slices across time ranges β is exactly where columnar storage shines and row-oriented storage struggles. A ClickHouse query scanning a week of minute-level aggregates for one short code touches a compact column chunk; the same query in Postgres scans full rows including fields the query does not need. At 50 million new links per day, each potentially accumulating aggregates, the performance gap compounds over time. What about Elasticsearch or Druid? Both are capable analytics stores, but they add significant operational complexity (cluster management, shard rebalancing, JVM tuning). ClickHouse gives us the read performance we need with simpler operations β a single binary with no JVM, straightforward replication, and native support for time-series aggregation patterns that match our minute-bucket schema exactly.
Why not just query the raw events? Because the raw event stream produces 600 GB per day. Running an online query across that volume every time someone checks their link stats would be painfully slow and expensive. Pre-aggregating into minute buckets means the stats endpoint reads a few compact rows instead of scanning millions of raw events.
How this protects the redirect path: the key implementation detail is when the Kafka publish happens relative to the HTTP response. The redirect service resolves the mapping, sends the 302 response back to the client, and then publishes the click event to Kafka in a background async context (a separate goroutine, a non-blocking thread, or a local buffer that flushes in batch). The HTTP response is never blocked by the Kafka write. The client gets their redirect in ~3-5 ms regardless of Kafka's health.
But what if Kafka itself is unreachable β a broker outage, a network partition? The redirect still succeeds. The click event is either dropped or buffered briefly in a small in-memory queue on the API instance (with a bounded size to prevent memory pressure). If the buffer fills before Kafka recovers, the oldest events are dropped. This means click counts may be slightly undercounted during a Kafka outage β a minor accuracy tradeoff that protects the core product. We explicitly accept this: a redirect that works with slightly inaccurate analytics is infinitely better than a redirect that fails because an analytics dependency is down.
Here is a concrete example: suppose Kafka consumers go down for 7 minutes due to a deployment issue. During those 7 minutes, click events accumulate in Kafka's partitions (which are designed to buffer exactly this kind of backlog). The stats dashboard shows slightly stale numbers. But the redirect success rate? Unchanged. Zero impact. When the consumers recover, they process the backlog and analytics catch up. This is graceful degradation: the core product stays healthy while secondary features absorb the impact.
What this step gives us: a complete data path from link creation through redirect to analytics, with the critical redirect path fully isolated from analytics system health.
What is still missing: we have not addressed link lifecycle β expiration, abuse controls, custom alias management. These are product controls that affect both the create and redirect paths.
Step D: Product Controls and Lifecycle Management
Our system can now create links, redirect clicks efficiently, and track analytics asynchronously. But it has no way to expire a link, disable a malicious one, or manage vanity aliases safely. These are not nice-to-have features β they are operational necessities.
Expiration checks happen in the redirect path. When the service resolves a short code, it also checks the expiresAt field. If the current time is past expiration, the service returns a 410 instead of redirecting. This is a single timestamp comparison β essentially free in terms of performance β and it prevents stale campaign links from living forever.
Disabled status is the security team's emergency brake. When a phishing link is reported at 11:07 AM, setting status=DISABLED on that link's record blocks all future redirects immediately.
But there is a subtlety we need to address: when a link's status changes to DISABLED in DynamoDB, the Redis cache might still hold the old ACTIVE record. Without a cache invalidation mechanism, the disabled link continues to redirect users until the cache TTL expires β which could be minutes. For a phishing link, minutes of continued redirects is unacceptable.
Cache invalidation on status change: when the admin service sets status=DISABLED, it also publishes an invalidation event (via a lightweight pub/sub channel β Redis Pub/Sub itself works well here, since our cache nodes are already Redis). Each API service instance subscribes to the invalidation channel. On receiving an invalidation for a specific shortCode, the instance deletes that key from its local Redis shard. The next redirect request for that code triggers a cache miss, fetches the now-disabled record from DynamoDB, and returns a 410. Total propagation time: typically under 100 ms across all cache nodes. The phishing link is effectively killed system-wide within a fraction of a second of the admin action.
Why not just set very short cache TTLs? Because short TTLs (say, 5 seconds) would dramatically reduce our cache hit rate. At 579,000 peak reads/s, even a 95% hit rate (vs 99%) means DynamoDB absorbs 5Γ more traffic β roughly 29,000 reads/s instead of 5,800. Targeted invalidation gives us the best of both worlds: long TTLs for high cache efficiency, plus instant invalidation for the rare emergency case.
Alias reservation prevents a subtle but dangerous race condition. Imagine two users simultaneously try to claim the alias promo2024. Without an atomic reservation mechanism, both might pass the "does this alias exist?" check before either writes, and one mapping gets silently overwritten. We solve this by introducing an alias reservation record that is claimed atomically β if the reservation already exists, the second request fails immediately with a 409 Conflict.
Retention jobs manage the lifecycle of raw click events. At 600 GB/day, raw events with 30-day retention consume roughly 18 TB (single copy) or 54 TB with replication. TTL-based expiration on raw event data is mandatory to keep storage costs bounded. Aggregated data (the minute-level buckets in ClickHouse) can be retained much longer since it is orders of magnitude smaller.
With these controls in place, our system handles the full product lifecycle: links are created, accessed, tracked, and eventually expire or get disabled β all without compromising the performance of the redirect path.
The Life of a Redirect β One Request, Start to Finish
We have described many pieces across four steps. Let us walk through exactly what happens when a user clicks https://short.ly/Ab3x9K7Q, in order, to make sure the full picture is crisp:
- DNS + CDN: The browser resolves
short.ly, hits the CDN edge. If the CDN has a cached redirect for this code, it returns 302 immediately. The request never reaches our servers. (This only applies to popular links with stable destinations.)
- Load balancer β API instance: On a CDN miss, the request reaches our load balancer and routes to any available stateless API instance. No sticky sessions β any instance can serve any request.
- Local hot-key cache check: The API instance checks its small in-process LRU. If this code is trending (>500 rps on this instance), the mapping is served from local memory. Latency: <0.1 ms.
- Redis lookup: On a local cache miss, the instance queries Redis. This is the common path for most requests. Latency: ~1-2 ms. On a cache hit, we have the full
Linkrecord.
- DynamoDB fallback: On a Redis miss (roughly 1% of requests), the instance fetches from DynamoDB, fills the Redis cache with TTL jitter, and proceeds. Latency: ~5-10 ms.
- Validity checks: The instance inspects
statusandexpiresAt. Ifstatus=DISABLEDorstatus=QUARANTINED, return 410. IfexpiresAtis in the past, return 410. These are in-memory comparisons on the already-fetched record β essentially zero cost.
- Send 302 response: The instance returns
302 Foundwith theLocationheader set tolongUrl. The user's browser begins navigating to the destination. The HTTP response is now complete.
- Async analytics publish: After the response is sent, a background task publishes a
ClickEventto Kafka. If Kafka is unreachable, the event is buffered briefly or dropped. The user never waits for this step.
Total latency for the common case (Redis hit + valid link): ~3-5 ms. Even the worst case (CDN miss + Redis miss + DynamoDB fetch): ~10-15 ms. Both are well within our 50 ms p95 target.
Notice the dependency chain for a successful redirect: CDN (optional) β Redis β DynamoDB (on miss). Kafka, ClickHouse, aggregation workers, and the moderation system are not in this chain. That isolation is the single most important architectural property of this design.
Here is the final architecture after all four steps:
ββββββββββββ βββββββββββββββ
β Client ββββββΆβ CDN Edge βββHITβββΆ 302 redirect
β β β (optional) β
ββββββββββββ ββββββββ¬βββββββ
β MISS
βΌ
ββββββββββββββββ βββββββββββ βββββββββββββ
β API Service ββββββΆβ Redis ββββββΆβ DynamoDB β
β (stateless, βββββββ (cache) β β (Links) β
β N instances)β β β βββββββββββββ
ββββββββ¬ββββββββ β β
β β Pub/Sub ββββ invalidation on disable
β async βββββββββββ
βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββ
β Kafka ββββββΆβ Aggregation ββββββΆβ ClickHouse β
β (128 parts) β β Workers β β (agg table)β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββNow that the architecture has evolved through four progressive stages, our data model needs to catch up.
6. Core Entities (v2) β How the Architecture Changed Our Data
When we first defined our entities in Section 3, we were working from requirements alone. Now we have built an architecture with caching, async analytics, abuse controls, and alias management. New workflows have revealed new data needs.
Here is the updated entity set:
Link(shortCode, longUrl, createdAt, expiresAt, status, ownerId, riskScore?)
AliasReservation(alias, ownerId, status, createdAt)
ClickEvent(eventId, shortCode, ts, ua, ipHash, referrer)
ClickAggregate(shortCode, minuteBucket, clicks, uniqueApprox)
AbuseDecision(entity, decision, reason, updatedAt)Let us walk through what changed and why.
Link gained riskScore? β when we added abuse controls in Step D, we realized the redirect path benefits from a lightweight risk signal stored directly on the link record. A risk score lets the system apply graduated responses (quarantine, warning interstitial, or outright block) without querying an external moderation service on the hot path.
AliasReservation is new. This entity did not exist in v1 because we had not yet encountered the vanity alias race condition. It emerged from Step D's lifecycle requirements: atomic reservation prevents two users from claiming the same alias.
ClickAggregate gained uniqueApprox. When we built the analytics pipeline in Step C, approximate unique visitor counts (via HyperLogLog or similar sketches) became a natural addition to minute-level aggregates β cheap to compute during aggregation and valuable for basic analytics.
AbuseDecision is new. It captures moderation actions with an audit trail: what entity was flagged, what decision was made, why, and when. This supports the abuse workflow introduced in Step D and gives the security team an operational history.
This is how data models should evolve β not by guessing upfront, but by responding to the queries and workflows that the architecture actually requires. The v1 model got us started. The v2 model reflects what we learned by building the system.
With the full architecture in place and the data model updated, we are ready to stress-test the design's weak points.
7. Deep Dives β Stress-Testing Where Things Break
An architecture diagram shows how a system works when everything goes right. Deep dives show what happens when things go wrong β and how the design recovers. We will examine five critical areas, each presented as a progression from a naive approach through a good solution to a great one.
Deep Dive 1: Code Generation and Collisions
At its core, a URL shortener has one critical invariant: every short code must map to exactly one destination URL. If two different long URLs end up with the same short code, one mapping silently overwrites the other, and a user's link starts pointing to someone else's content. This is not a theoretical concern β at 5,800 peak creates per second, collision safety is an engineering requirement, not a nice-to-have.
The naive approach: random code, check-then-write.
Generate a random Base62 code, query the database to see if it exists, and if it does not, write the mapping. This seems logical but has a fatal flaw: the check and the write are two separate operations. Under concurrent load, two application servers can both check the same code, both find it does not exist, and both write β the last writer wins, and the first writer's mapping is silently destroyed.
At 5,800 writes/s across multiple app servers, this is not unlikely. It is virtually guaranteed to happen eventually.
The good approach: random code + database uniqueness constraint + retry.
Instead of relying on an application-level existence check, we set a uniqueness constraint on the shortCode key in the database. If two concurrent writes try to insert the same code, the database rejects the second one, and the application retries with a new random code.
How frequent are retries? With an 8-character Base62 space (218 trillion possible codes) and even billions of existing links, the probability of a random collision is negligibly small β well under 0.01%. At 5,800 writes/s, a 0.01% collision rate means roughly 0.58 retries per second across the entire system. That is noise, not a bottleneck.
This approach is safe because the database serializes uniqueness. Collisions do not cause data corruption β they cause a retry, which is cheap.
The great approach: monotonic ID β Base62 encoding with obfuscation.
Instead of generating random codes and hoping they do not collide, we assign each new link a sequential numeric ID (from a distributed ID generator like Snowflake), encode it as Base62, and optionally obfuscate it so that sequential IDs do not produce guessable sequential short codes.
This eliminates collision retries entirely β every ID is unique by construction. Create latency becomes fully predictable because there is never a retry loop.
The tradeoff: sequential IDs can reveal your system's creation volume. If someone decodes Ab3x9K7Q and gets ID 7,294,531, they know roughly how many links exist. Obfuscation (a simple XOR or Feistel cipher on the ID before Base62 encoding) addresses this, but adds a small layer of complexity to the code generation path.
Deep Dive 2: Hot Keys Under Extreme Conditions
We introduced Redis caching in Step B to protect DynamoDB from peak read load. That solved the steady-state problem. But caching introduces its own failure modes β and URL shorteners, where a single link can dominate all traffic, are uniquely vulnerable to them. Let us go beyond the basic cache setup and examine what happens when caching itself is pushed to its limits.
The scenario: a major news event breaks, and a single short link receives 100,000 requests per second β sustained for hours, not seconds.
The naive approach: database-only reads.
Without any cache, all 100,000 rps hit DynamoDB directly. DynamoDB's per-partition eventual-consistency read guidance is roughly 6,000 reads/s. That means 94,000 requests per second are throttled or retried. The result is exponential backoff pressure, connection pool exhaustion on the application servers, and timeout cascading across the entire redirect path β not just for the viral link, but for every link sharing those overloaded app servers. One hot key brought down the entire system.
The good approach: Redis cache with TTL jitter and single-flight refill.
As we designed in Step B, Redis absorbs the vast majority of reads. At a 99% cache hit rate, DynamoDB sees roughly 1,000 rps for the viral key β comfortable. TTL jitter prevents synchronized expiration, and single-flight refill ensures only one request per cache node fetches from DynamoDB on a miss. This handles most real-world viral events without issue.
But what if the Redis node responsible for this hot key fails entirely?
The great approach: multi-tier caching with local hot-key promotion.
When a single key dominates traffic, even a healthy Redis cluster can become a bottleneck β every request for that key routes to the same Redis shard (consistent hashing concentrates a single key on one node). If that shard is handling 100,000 rps for one key plus its normal load for thousands of other keys, it can hit CPU or network limits.
The solution is a thin local in-process cache on each API service instance β a small LRU cache (perhaps 1,000 entries) that promotes the hottest keys automatically based on access frequency. A key receiving more than a threshold (say, 500 rps on a single instance) gets cached locally with a very short TTL (2-5 seconds). Now 100,000 rps across 50 API instances means each instance handles ~2,000 rps for that key from its own memory β Redis never even sees the request.
This local cache is not for general use (it would reduce consistency for normal keys). It is specifically a hot-key pressure valve: it activates only for keys that exceed a frequency threshold, and its TTL is short enough that staleness is bounded to seconds.
The tradeoff: the cache invalidation mechanism we designed in Step D (Redis Pub/Sub for disabling links) now needs to also clear local caches. The API service's invalidation subscriber must evict from both Redis and the local LRU. This adds a small amount of implementation complexity, but the payoff β absorbing extreme hot-key load without any single component becoming a bottleneck β is significant.
Deep Dive 3: Analytics Accuracy Under Failure
In Step C, we established why analytics must be asynchronous β synchronous writes in the redirect path would blow our 50 ms SLA. That architectural decision is settled. The question now is different: given an async pipeline with Kafka, aggregation workers, and ClickHouse, how accurate are the click counts, and what degrades that accuracy?
The naive approach: treating click counts as exact.
If we promise users an exact click count and deliver something that is occasionally off, we create a trust problem. The reality is that any async pipeline introduces sources of inaccuracy. Understanding them β and bounding them β is the difference between a system users trust and one they doubt.
There are three distinct failure modes that affect accuracy:
Event loss during Kafka publish failures. As we discussed in Step C, if Kafka is unreachable, the API service drops or buffers events. During a 5-minute Kafka outage at peak traffic (579,000 events/s), we could lose up to ~174 million events. In practice, the local buffer absorbs most short-duration failures (a 500-event buffer per instance Γ 50 instances = 25,000 events of headroom, covering ~43 ms of peak traffic per instance). For outages longer than the buffer can absorb, events are lost. Mitigation: the stats API should communicate that counts are approximate, using language like "~1.2M clicks" rather than "exactly 1,200,000 clicks."
Double counting during consumer replays. If a Kafka consumer crashes and reprocesses a batch, the eventId deduplication we designed prevents double counts β but only if the deduplication window is wide enough. A typical approach stores seen eventId values for a rolling window (e.g., 2 hours). Events replayed from beyond that window could theoretically be double-counted. At our volume, keeping a 2-hour deduplication window requires storing roughly 4.2 billion eventIds (579k/s Γ 7200s). Using a Bloom filter with a 0.01% false positive rate, this costs about 5 GB of memory per consumer group β manageable.
Late-arriving events that miss their minute bucket. If a click event arrives at the aggregation worker 3 minutes after it was generated (due to Kafka consumer lag), which minute bucket does it land in? The event's ts field determines the bucket, not the processing time. So the minute bucket for 10:05 might receive new events at 10:08. This means the stats API serving data for "clicks in the last 5 minutes" might show incomplete counts for the most recent buckets. Mitigation: apply a watermark β the stats API considers any bucket within the last 2 minutes as "still filling" and marks it accordingly in the response.
The good approach: bounded inaccuracy with transparent communication.
Accept that counts are approximate. Design the stats API response to include a confidence field or a lastUpdatedAt timestamp so consumers know how fresh the data is. Under normal operation, analytics lag is under 5 seconds. Under Kafka consumer lag, it could stretch to minutes β and the API communicates this honestly rather than serving stale data silently.
The great approach: end-to-end accuracy monitoring with reconciliation.
Run a periodic reconciliation job that compares Kafka offsets (how many events were produced) against ClickHouse aggregate sums (how many events were counted). The delta reveals the system's drift. If drift exceeds a threshold (say, 0.1% over any 1-hour window), the job triggers an alert and optionally replays the affected Kafka offsets to recompute the aggregates. This gives operators a concrete accuracy SLA: "click counts are accurate to within 0.1% under normal operation and self-heal within 1 hour of any pipeline disruption."
Deep Dive 4: Abuse Controls β Your Domain's Reputation Is at Stake
URL shorteners are a gift to phishing campaigns. A short link masks the true destination, making it easy to trick users into clicking malicious URLs. If your short domain gets a reputation for hosting phishing links, email providers and browsers will start blocking all links on your domain β not just the malicious ones. Abuse resistance is not a feature; it is existential.
The naive approach: no validation, no rate limiting.
Without controls, a bot can create 10,000 phishing links per minute using a simple script. At our system's peak capacity of 5,800 creates/s, even a modest botnet consuming 5% of that capacity mints 290 malicious links per second β roughly 17,400 per minute. Each link looks legitimate (it is on your domain, after all) and redirects to a convincing fake login page. Within hours, your domain reputation is damaged. Major email providers flag your links as suspicious. Legitimate users find their short links being blocked. The damage compounds: once a domain is flagged, it can take weeks to restore its reputation, even after the malicious links are removed.
The good approach: rate limiting + URL validation.
A token bucket rate limiter on the create endpoint caps how many links any single client can create. We set the bucket at 10 links per minute per API key for standard users and 100 per minute for verified business accounts. This means even a compromised API key can create at most 10 malicious links before the bucket empties β a manageable blast radius compared to 17,400.
URL validation rejects dangerous schemes (javascript:, data:) and checks the destination domain against known blocklists. At 5,800 peak creates/s, this validation adds roughly 1-2 ms per request (blocklist lookup from a local Bloom filter or in-memory set), which is negligible.
This stops the obvious attacks β bulk creation bots and clearly dangerous URLs. But sophisticated attackers use rotating IPs and API keys to bypass per-client rate limits, and newly registered phishing domains are not in any blocklist yet.
The great approach: risk scoring + quarantine workflow.
In addition to rate limiting, the system assigns a risk score (0-100) to each new link at creation time, based on signals: destination domain age (domains less than 7 days old score +30), URL pattern similarity to known phishing templates (+20 for login/password/verify in path), creation velocity from the originating account (+15 if more than 50 links in the last hour), and geographic anomalies (+10 if the creator's IP country mismatches the account's historical pattern).
Links scoring above 60 enter a quarantine state: they are created in the database with status=QUARANTINED and the redirect path returns a warning interstitial page instead of a direct redirect. A moderator (or an automated review system using more expensive deep analysis) evaluates them within a target SLA of 5 minutes. Links scoring above 85 are blocked outright and require manual review to activate.
How fast does quarantine need to be? The critical metric is mean time to first click β on average, a phishing link receives its first victim click within 4-7 minutes of being shared. If quarantine triggers at creation time (before the link is ever shared), even a 5-minute review SLA means most phishing links are caught before they cause any damage.
The tradeoff: false positives. At a score threshold of 60, expect roughly 2-5% of legitimate links to trigger quarantine β particularly links to new domains or links created in bulk for marketing campaigns. This requires moderation tooling (a review dashboard, an appeal workflow) and review staffing. But the alternative β a blacklisted domain where every legitimate link is blocked by email providers β is orders of magnitude more expensive.
Deep Dive 5: Reliability and Graceful Degradation
Our system has a clear hierarchy of importance: redirects must work, always. Analytics can lag. Admin features can be temporarily unavailable. The architecture should enforce this hierarchy, not just hope it holds.
The naive approach: single region, tightly coupled.
Everything runs in one cloud region. If that region has a networking issue or a control-plane incident, the entire service β redirects included β goes offline. All users, worldwide, lose access.
The good approach: multi-region read replicas with failover.
Deploying read replicas (for the cache and database) across multiple regions provides two benefits. First, users are served from the nearest healthy region, reducing round-trip latency β an APAC user hitting a local replica saves 100+ ms compared to a cross-ocean request. Second, if one region fails, traffic automatically routes to healthy regions.
For DynamoDB, this is achieved with Global Tables β multi-region, active-active replication where a write in any region propagates to all others with typical replication lag under one second. This means link creation can happen in any region, and the mapping becomes available globally almost immediately. For Redis, we replicate the cache per region independently (each region's cache warms from its local DynamoDB replica).
The write path can remain in a primary region (link creation is far less latency-sensitive than redirect), or with Global Tables, writes can be accepted in any region for geographic write affinity. The read-heavy redirect path is served from the nearest region regardless.
The great approach: failure isolation between critical and non-critical paths.
Beyond multi-region, the greatest reliability win comes from ensuring that non-critical system failures cannot propagate to the critical redirect path. As we walked through in "The Life of a Redirect," the dependency chain for a successful redirect is:
Redirect path dependencies (in the HTTP response path): local LRU (optional) β Redis cache β DynamoDB (on cache miss). That is it. The Kafka publish happens after the response is sent and is non-blocking β a Kafka failure does not affect redirect success.
Not in the redirect response path: Kafka, aggregation workers, ClickHouse, the moderation/abuse service, the stats API, retention jobs. None of these can affect redirect latency or availability.
Consider a concrete scenario: Kafka brokers degrade for 15 minutes. In a tightly coupled system, this could backpressure the redirect service, causing it to slow down or fail. In our design, the redirect path only touches the cache and the mapping store β it has no dependency on Kafka, ClickHouse, or the aggregation workers. Analytics lag for 15 minutes and catch up later. Redirect success rate? Unchanged. Zero impact.
Now consider a worse scenario: Redis fails entirely in one region. The redirect path falls back to DynamoDB directly. At this point, DynamoDB absorbs 100% of redirect traffic for that region. If the region handles 100,000 rps, this exceeds comfortable DynamoDB limits β but because we have multi-region deployment, DNS-based failover routes traffic to healthy regions within seconds. The degraded region serves errors briefly while traffic drains, and the healthy regions absorb the additional load. No data is lost because DynamoDB is the source of truth and Redis was only a cache.
This is graceful degradation in action: the system sheds non-critical work under stress to preserve the core product. The error budget is spent on secondary features, not on the user-facing experience.
8. Tying It All Together β From One Box to Production
Look at how far we have come. We started with the simplest thing that could work: one API server and one database table. Then we hit a wall β 579,000 reads per second would crush any single database β and we added a cache. That cache created a new problem (thundering herds), so we added jitter and single-flight refill. Analytics in the redirect path threatened our latency SLA, so we decoupled it into Kafka and ClickHouse. Phishing links threatened our domain reputation, so we added rate limiting, risk scoring, and cache invalidation.
At no point did we add a component because "best practice says so." Every single addition was forced by a specific number or a specific failure scenario. That is the discipline: complexity must be earned by evidence, never assumed by convention.
Here is the final ledger:
| Step | Problem Solved | Component Added | Number That Forced It |
|---|---|---|---|
| A | Basic create + redirect | API Service + DynamoDB | 5,800 peak writes/s (manageable by one store) |
| B | Hot keys + peak reads | Redis cache + CDN | 579,000 peak reads/s vs 6,000/partition limit |
| C | Analytics coupling | Kafka + Workers + ClickHouse | 600 GB/day raw events; 50 ms redirect SLA |
| D | Lifecycle + abuse | Pub/Sub invalidation + Reservation + Risk scoring | Phishing links need <100 ms kill; 18 TB/30d retention |
The data model grew alongside the architecture:
| Entity | In v1? | In v2? | What Triggered the Change |
|---|---|---|---|
| Link | Yes | Yes (+ riskScore) | Abuse controls needed a hot-path risk signal |
| ClickEvent | Yes | Yes | Unchanged β original design was right |
| ClickAggregate | Yes | Yes (+ uniqueApprox) | Analytics pipeline made unique counts cheap |
| AliasReservation | No | Yes | Vanity alias race condition in Step D |
| AbuseDecision | No | Yes | Moderation audit trail in Step D |
Every component and every entity field traces back to a specific number or a specific failure scenario. Nothing is there because "best practice says so." Everything is there because the math demanded it.
9. Common Mistakes β What Gets Candidates Rejected
Having walked through the full design, here are the patterns that consistently trip people up in interviews and design reviews:
Not committing to specific technologies. Saying "we will use a database" or "some kind of cache" shows you are avoiding decisions. Interviewers want to hear "DynamoDB because our access pattern is single-key lookup at 579k peak reads/s" β the specific choice tied to a specific reason.
Presenting architecture without numbers. "We need a cache for performance" is a vague assertion. "579,000 peak reads/s against a database with 6,000 reads/s per hot partition means we need 99% cache hit rate to keep DB load under 6k rps" is an engineering argument.
Ignoring hot partition math. Average throughput is not what breaks systems β skewed traffic is. If you do not calculate what happens when one key receives 100,000 rps, you have not designed for reality.
Querying raw event data for online APIs. Scanning 600 GB/day of raw click events to serve a stats endpoint is a design flaw, not a feature. Pre-aggregate.
Never revising the data model. If your entities look the same at the end as they did at the beginning, you either guessed perfectly (unlikely) or you did not let the architecture inform the model.
Coupling redirect success to analytics pipelines. The moment your redirect latency depends on ClickHouse insert speed, you have violated the most important non-functional requirement. Decouple.
Waving hands at cache invalidation. Saying "we will use a cache" without explaining what happens when cached data becomes stale β especially for disabled or expired links β leaves a critical gap. The interviewer will probe it.
No failure story. If you cannot explain what breaks first and how the system recovers, the design is incomplete. Every interviewer wants to hear about cache stampedes, consumer crashes, and regional failover β not just the happy path.
10. What the Interviewer Is Actually Evaluating
Behind every question and follow-up, the interviewer is checking a short list of skills:
Did you connect estimated load to system limits? Not just "we need a cache" but "here is the math that proves we need a cache, and here is the hit rate that makes the math work."
Did you justify technology choices with why and why-not? Picking DynamoDB is fine. Picking DynamoDB and explaining why Postgres struggles with the specific hot-key read pattern at this scale is better.
Did you let the data model evolve? Starting with v1 entities and refining them after the architecture reveals new access patterns shows maturity. Designing the final schema upfront suggests you are reciting, not reasoning.
Did you isolate the critical path? The redirect must survive analytics outages, moderation pipeline failures, and cache node restarts. Demonstrating this isolation β with concrete failure scenarios β is what separates good designs from great ones.
Did you explain what breaks and how you recover? Systems fail. The question is not whether your design prevents all failure, but whether it fails gracefully β shedding non-critical work to protect the core experience.
Final Thought
A URL shortener is a deceptively simple system. The product is one redirect, resolved in under 50 milliseconds. But behind that redirect is a web of decisions β about key distribution, cache invalidation, pipeline isolation, abuse resistance, and failure boundaries β that separate a toy prototype from a production system serving billions of requests.
We started with a single API server and a database. We ended with a multi-region, multi-tier architecture with async analytics, targeted cache invalidation, abuse scoring, and graceful degradation. Not because we planned it all upfront, but because we let the numbers tell us what to build next.
The goal of this walkthrough was not just to design a URL shortener. It was to demonstrate a way of thinking: start with requirements, let numbers drive decisions, build progressively, and always know what breaks first. That approach works for URL shorteners, and it works for every system design problem you will ever face.
Continue Learning
Video
PRODesign a TikTok-like Short Video Platform
A 16-year-old in Jakarta uploads a dance video at 11:47 PM local time. She has 47 followers. Within 6 hours, that video has 12 million views, 800,000 likes, and has spawned 3,400 duets. The recomme...
Web Services
Design a URL Shortener
A URL shortener looks trivial. Accept a long URL, return a short one, redirect anyone who clicks it. You could build a working prototype in an afternoon with a hash map and a web server.
Notifications
Design a Notification System
You receive dozens of notifications every day β a shipping update, a login verification code, a friend's message, a price drop alert. Each one feels trivial. But behind that "Your order has shipped...