Design a Twitter/X Home Timeline
Design a Twitter/X Home Timeline (Interviewer Walkthrough)
Goal: fast home feed reads at massive scale while handling write amplification and celebrity skew.
---
0) Pre-Design Research Inputs
Key research:
- Feed systems typically compare fanout-on-write vs fanout-on-read.
- Redis sorted structures are common for fast recency-ordered retrieval.
- Ordering and queue partitioning decisions must align with consistency needs.
Design implications:
- Hybrid feed strategy is required (not one-size-fits-all).
- Celebrity accounts need special handling.
- Timeline cache/store should optimize top-N recent reads.
---
1) Requirements
Now that we have problem context, let us pin down what this feed system must deliver before debating fanout strategy.
Functional
- User can post tweet.
- User can follow/unfollow.
- User can load home timeline quickly.
Non-functional
- Home timeline p95 < 200ms.
- Very high read scale.
- Eventual consistency acceptable for timeline updates.
- System handles skew (celebrity followers).
---
2) Capacity Estimation
With requirements fixed, we quantify read/write asymmetry, because feed architecture is primarily decided by this ratio.
Assumptions:
- DAU:
100M - Avg tweets/day:
500M - Home feed reads/day:
20B - Peak factor:
8x
Numbers:
- Tweet writes avg:
~5.8k/s, peak~46k/s - Feed reads avg:
~231k/s, peak~1.85M/s
Implication:
- Read path dominates by huge margin.
- Precomputation/caching for home feed is essential.
---
2.1) Storage + Ops
Now let us estimate storage and write amplification impact, especially for fanout-heavy designs.
Assume timeline entry ~80B (tweetId + authorId + score/timestamp + metadata):
- If fanout-on-write for all users, storage duplication is massive.
Example:
- Avg followers 300 => one tweet can generate ~300 timeline writes.
- At
46k tweet/s peak, naive fanout writes can approach13.8M timeline writes/s.
Implication:
- Full fanout-on-write for everyone is too expensive at skew.
- Need hybrid strategy.
---
3) Core Entities v1
With scale pressure visible, we define entities that separate source-of-truth data from serving-model data.
User(userId, ...)Tweet(tweetId, authorId, createdAt, text, mediaRef?)FollowEdge(followerId, followeeId, createdAt)HomeTimeline(userId, scoreTs, tweetId, authorId)(materialized)
Thought process:
Tweetis source of truth.HomeTimelineis serving model optimized for read latency.FollowEdgedrives both write fanout and read assembly.
Functional requirement traceability:
FR1 (post tweet)->Tweetpersists canonical content.FR2 (follow/unfollow)->FollowEdgecontrols feed eligibility and fanout edges.FR3 (load home timeline)->HomeTimelineserves low-latency read path.
Why this mapping matters:
- Feed systems fail when entities are generic; this mapping keeps model tied to product behavior.
---
4) API / Interface
Now that entities are set, we define minimal feed contracts and parameters that support low-latency pagination.
POST /v1/tweetsPOST /v1/followsDELETE /v1/follows/{followeeId}GET /v1/home?cursor=...&limit=...
Parameter reasoning:
cursorsupports pagination without expensive deep offsets.limitbounded to protect tail latency and cache efficiency.
Functional requirement to API mapping:
FR1->POST /v1/tweetsFR2->POST /v1/follows,DELETE /v1/follows/{followeeId}FR3->GET /v1/home?cursor=...&limit=...
Why this mapping matters:
- API shape is justified by requirements; no unused endpoints in interview scope.
---
5) High-Level Design (progressive)
Now we build the architecture incrementally: first correctness, then read optimization, then skew handling, then resilience.
Step A: Correct baseline
User calls POST /v1/tweets to create a tweet, which is persisted in the tweet store. User calls GET /v1/home?cursor=...&limit=... to load their home timeline. The feed service must fetch the user's followees from the follow graph, retrieve recent tweets from each, merge-sort by time, and return paginated results. The question is: what baseline flow gives us correct timeline assembly before optimization?
Components:
- Tweet service + tweet store
- Follow graph store
- Feed read service
Baseline behavior:
- On read, fetch recent tweets from followees and merge-sort.
Why baseline:
- Simple correctness demonstration before optimization.
Decision details (data stores in baseline):
- Tweet store choice: Cassandra-style append-optimized store keyed by author/time.
- Follow graph choice: relational/graph-friendly store keyed by
followerId -> followeeId. - Why this split:
- tweet writes are high-volume append; - follow graph needs edge queries and consistency for follow/unfollow semantics.
- Why not one store for both:
- access patterns differ significantly; single-model compromise hurts either write throughput or edge-query efficiency.
Step B: Fanout-on-write for normal users
When a user calls POST /v1/tweets, fanout workers push the tweet entry into each follower's HomeTimeline store. When a follower calls GET /v1/home, the feed service reads precomputed entries instead of assembling from scratch. The question is: how do we make reads fast by precomputing timelines without exploding write costs?
Components added:
- Fanout workers
- Home timeline cache/store (e.g., Redis + persistent backing)
Decision:
- Push tweets into followers' home timelines for non-celebrity authors.
Why:
- Read volume is much higher than write; precompute to reduce read-time joins.
Choice details:
- Use async fanout workers reading tweet events.
- Materialize into
HomeTimeline(userId, scoreTs, tweetId)serving model. - Keep timeline in Redis for hot reads with persistent backing store for recovery.
Why this solves:
- Converts expensive read-time merges into cheap top-N reads for most users.
- Keeps home endpoint p95 stable under heavy read load.
Why not fanout-on-read only:
- At
~1.85M/speak reads, per-request merge of many followees is too expensive.
How with numbers:
- Reads peak
1.85M/s; serving from prebuilt timeline keeps feed request bounded to top-N lookup. - Example:
- if each read merged even 200 followee streams, backend query fanout would be huge. - precomputed timeline reduces this to one or few key-range reads.
Step C: Celebrity hybrid strategy
When a celebrity with millions of followers calls POST /v1/tweets, the fanout-on-write approach from Step B would generate millions of HomeTimeline writes. A single tweet from an account with 20M followers produces 20M timeline inserts, creating massive write amplification and potential fanout worker backlogs that degrade feed freshness for everyone.
Decision:
- For high-follower accounts, skip full fanout-on-write; fetch on read (fanout-on-read for celebrity edges).
Why:
- Prevent write explosion from celebrity posts.
Choice details:
- Define dynamic threshold (e.g., follower count or historical fanout cost) for "celebrity mode."
- For celebrity tweets, store canonical tweet and inject at read time for followers.
Why this solves:
- caps extreme write amplification from rare high-fanout authors.
Why not fully fanout-on-write:
- one celebrity tweet can produce tens of millions of writes.
- with burst posting, fanout workers backlog and degrade feed freshness for everyone.
How with numbers:
- If account has
20Mfollowers, one tweet => 20M timeline writes if fully fanned out. - Hybrid avoids this spike and shifts controlled merge cost to readers following celebrities.
- Tradeoff:
- feed assembly path becomes heterogeneous (push + pull merge).
Step D: Ranking + caching + freshness
When GET /v1/home is called, the feed service may optionally call a ranking service to reorder tweets by engagement and personalization features before returning results. Cached timeline segments improve latency, but stale caches hurt freshness. The question is: how do we add ranking without making it a hard dependency that blocks feed availability?
Components added:
- Ranking service (lightweight in interview scope)
- Cache invalidation/refresh strategy
Decision:
- Use recency-first baseline with optional ranking features.
Why:
- Guarantees deterministic fallback if ranking service degrades.
Choice details:
- Recency order is always available from timeline store.
- Ranking service enriches top window only (not entire feed).
- Cache ranked segments with short TTL.
Why this solves:
- personalization improves quality when healthy;
- service remains available when ranking/feature pipelines fail.
Why not ranking-hard-dependency:
- adds failure coupling between feed availability and ML/feature services.
How with numbers:
- keep home feed p95 target (<200ms) by capping ranking calls and falling back to recency when ranking latency breaches budget.
---
Core Entities v2
After these architecture changes, we refine entities so hybrid fanout and ranking controls are explicit in the data model.
HomeTimeline(..., sourceType)wheresourceType=materialized|pullUserFeedConfig(userId, rankingMode, language, contentPrefs)TweetEngagement(tweetId, likes, replies, retweets)for ranking features
Why changed:
- Added source-type to support hybrid feed merge.
- Added config/engagement to evolve ranking without changing core tweet store.
---
6) Deep Dives (numeric + mechanism)
With the complete design in place, let us stress-test each high-risk tradeoff in the same order we introduced it.
Deep Dive 1: Fanout strategy
Now that we implemented hybrid fanout, we first verify why this tradeoff beats one-size-fits-all approaches. Bad: full fanout-on-read only.
- Why bad: every home read becomes expensive multi-source merge at huge read QPS.
- Example: peak home reads
~1.85M/s. - How technically:
- each request must fetch recent tweets from many followees and merge-sort. - backend query fanout and CPU per request increase sharply, hurting p95.
Good: full fanout-on-write for everyone.
- Why good: home reads become cheap top-N lookups.
- Why it breaks: celebrity posts cause write explosion.
- Example: one user with 20M followers posts once -> 20M timeline inserts.
Great: hybrid fanout.
- Strategy: fanout-on-write for normal accounts, fanout-on-read for celebrity edges.
- Why great: balances read latency with write amplification control.
- Tradeoff: feed assembly logic is more complex for users following celebrities.
Deep Dive 2: Timeline storage choice
Next, we validate storage-layer choices because feed latency is highly sensitive to serving model design. Bad: query canonical tweet store on every home request.
- Why bad: tweet store optimized for writes/history, not low-latency per-user timeline assembly.
- Example: 500M tweets/day with high read fanout.
- How: repeated wide reads + merge at request time increase DB and service load.
Good: materialized home timeline entries.
- Why good: serving layer precomputes per-user candidates.
- Example: read path becomes "get top 50 by scoreTs" instead of multi-followee merge each time.
Great: in-memory timeline cache (Redis) + durable backing store.
- Why great: p95 is stabilized by cache; durable store enables recovery and backfill.
- Tradeoff: invalidation and warmup complexity.
Deep Dive 3: Cache invalidation
With serving model selected, we now examine freshness and invalidation behavior under updates/deletes. Bad: no invalidation strategy.
- Why bad: stale or missing tweets persist in home timeline.
- Example: tweet delete/unfollow not reflected quickly.
- How: cache entries stay outdated until manual refresh, hurting trust.
Good: TTL-based refresh.
- Why good: simple and bounded staleness.
- Limitation: short TTL increases backend load; long TTL increases staleness.
Great: event-driven partial invalidation + cursor/version checks.
- Example:
- delete event removes specific tweet from affected caches; - follow/unfollow event invalidates only impacted segments.
- Tradeoff: event plumbing and cache key design complexity.
Deep Dive 4: Follow churn
Now that invalidation is addressed, we test follow/unfollow churn correctness in precomputed timelines. Bad: ignore unfollow effects.
- Why bad: user continues seeing content from accounts they unfollowed.
- Example: user unfollows account at 10:00, still sees posts at 10:05.
- How: stale precomputed entries remain unless explicitly filtered.
Good: read-time follow filter.
- Why good: correctness at display time even if precomputed timeline is stale.
- Tradeoff: extra check on read path.
Great: background cleanup + read-time safety filter.
- Why great: cleanup reduces stale data while safety filter guarantees correctness.
- Example: async worker removes obsolete entries; read path still validates follow edge for zero-leak behavior.
Deep Dive 5: Reliability
Finally, we verify that the feed still works when ranking dependencies degrade. Bad: ranking service is hard dependency for every feed read.
- Why bad: ranking outage takes down home feed availability.
- Example: ranking feature store latency spike.
- How:
- feed request waits on ranking call; - p95 breaches SLA, timeouts increase.
Good: fallback to recency-only feed.
- Why good: preserves core product behavior even without personalization.
- Example: disable ranking enrichment toggle during incident.
Great: graceful degradation tiers.
- Tier 1: full ranking.
- Tier 2: lightweight ranking with cached features.
- Tier 3: pure recency.
- Why great: keeps service available while controlling latency/cost during incidents.
- Tradeoff: temporary relevance drop in feed quality.
---
7) Common mistakes
- Picking only one fanout model for all users.
- No handling of celebrity skew.
- No cursor design for feed pagination.
- Tight coupling between ranking and feed availability.
---
8) Interviewer signals
- Did you quantify read/write skew?
- Did you recognize and solve celebrity write amplification?
- Did you design fallback path when ranking/deep services fail?
---
9) References
- Fanout strategy overview (industry discussion): https://www.systemdesignsandbox.com/learn/fan-out-strategies
- Redis sorted set patterns: https://redis.io/learn/howtos/leaderboard
Key Takeaways
- 1User can follow/unfollow.
- 2User can load home timeline quickly.
- 3Home timeline p95 < 200ms.
- 4Very high read scale.
- 5Eventual consistency acceptable for timeline updates.
Continue Learning
Social Media
PRODesign a Tinder-like Dating Platform
Dating apps are not just swipe APIs. The hard parts are low-latency candidate serving, exactly-once-ish interaction recording, consistent match creation, and safe real-time messaging under abuse co...
Social Media
PRODesign a Strava-like Fitness Platform (Activities + Social Feed + Leaderboards)
Fitness platforms combine telemetry ingestion, geospatial processing, social interactions, and fairness-sensitive leaderboards. The hard part is balancing fast uploads and feed freshness with corre...
Social Media
PRODesign a Google News Feed Platform
News feed systems are latency-sensitive ranking systems under strict trust constraints. The hard part is not rendering cards; it is ingesting fast-changing publisher content, ranking per user inten...