Design a WhatsApp-Style Chat System
Design a WhatsApp-Style Chat System (Interviewer Walkthrough)
This is a real-time system design problem where correctness, latency, and failure handling matter more than drawing many boxes. The goal is to deliver a working chat system first, then harden it with scale and reliability decisions backed by research and numbers.
---
0) Pre-Design Research Inputs (done before architecture)
Primary references used to shape decisions:
- RFC 6455 (WebSocket): persistent bidirectional communication over a single upgraded TCP connection.
- Kafka ordering model: ordering guarantees are per partition, not across partitions.
- Redis Pub/Sub docs: at-most-once delivery semantics (no guaranteed delivery).
- Cassandra modeling docs: query-driven design, partition-key-first schema.
- FCM/APNs docs: push delivery is best effort; must not be your primary message durability mechanism.
How this research changes design:
- We choose WebSocket for real-time session channel.
- We choose per-chat partition key where ordering matters.
- We do not rely on Redis Pub/Sub as durable delivery.
- We keep durable messages in DB and use push only as wakeup/notification.
---
1) Requirements (~5 min)
Now that we aligned on research constraints, let us define exactly what this system must do in an interview scope before choosing any architecture.
Functional requirements (top 4)
- Users can create chats (1:1 and small groups).
- Users can send and receive messages in near real-time.
- Offline users can fetch missed messages when they reconnect.
- Users can send media (metadata in chat, blob in object storage).
Non-functional requirements
- Message delivery p95 to online recipient < 500 ms (in-region target).
- Durable message storage (no message loss after server ack).
- Scale to very high throughput and many concurrent connections.
- Tolerate partial failures (chat server/node/message-bus failures).
- Basic abuse protection and rate limits.
Scope control
- In scope: text/media metadata messaging, offline sync, ordering strategy, delivery ack.
- Out of scope: full E2EE protocol details, voice/video calling, spam ML models.
---
2) Capacity Estimation (decision-driving)
With requirements fixed, we now quantify traffic so design choices are forced by numbers rather than intuition.
Assumptions
- DAU:
200M - Avg messages per DAU/day:
20 - Total messages/day:
4B - Avg msg/s:
4B / 86400 ~= 46.3k msg/s - Peak factor:
10x->~463k msg/speak
Why this matters
- This is a high-write, always-on system.
- Any design that writes once per recipient can explode write amplification.
---
2.1) Storage + Ops Estimation
Based on throughput, let us now estimate storage and operational pressure, especially because chat systems accumulate data very quickly.
Message storage
Assume message row (IDs, chatId, senderId, type, timestamp, small content/meta pointer) ~500 bytes average:
- Daily storage:
4B * 500B = 2TB/day - 30-day hot retention:
~60TB(single copy) - 3x replication:
~180TB
Attachment strategy impact
If media binary were stored in chat DB (bad idea), storage and IO explode quickly. Decision: store media in object storage; keep only metadata + URL/token in message row.
Architecture forced by numbers
- Separate metadata DB from media blob storage.
- Partition message table carefully.
- Strong lifecycle policies (tiering, TTL for old delivery metadata).
---
3) Core Entities (~2 min)
Now that load and storage shape are clear, we can define the first draft of entities around the exact query patterns we must serve.
Core Entities v1
User(userId, ...)Chat(chatId, type, createdAt, createdBy)ChatParticipant(chatId, participantId, role, joinedAt)Message(chatId, messageId, senderId, serverTs, type, body/meta, mediaRef?)ClientSession(userId, deviceId, serverId, lastHeartbeatTs)DeliveryCursor(chatId, userId, lastDeliveredSeq, lastReadSeq)
Core entities are v1 draft; we evolve them after HLD and deep dives.
Thought process for entity selection
Primary queries:
- "Get latest N messages in chat."
- "Send message to all participants."
- "After reconnect, get messages since last seen/read."
Why each entity exists:
Chat,ChatParticipant: membership and authorization.Message: durable history and replay source.ClientSession: routing for online delivery.DeliveryCursor: avoids per-recipient inbox write amplification.
Functional requirement traceability:
FR1 (create chats)->Chat+ChatParticipant.FR2 (send/receive realtime)->Message+ClientSessionfor routing.FR3 (offline fetch)->Message+DeliveryCursorfor replay windows.FR4 (media support)->Message.mediaRef+MediaObjectin v2.
Why this mapping matters:
- It makes entity choices requirement-driven and prevents schema bloat.
Why DeliveryCursor over inbox-per-recipient for every message:
- At
463k msg/speak, per-recipient inbox writes can exceed DB limits quickly for groups. - Cursor-based sync writes once per message + cheap cursor updates.
---
4) API / System Interface (~5 min)
With entities defined, let us define the realtime contract that clients and servers use, and explain why each command field exists.
Realtime socket commands (WebSocket over TLS)
createChatsendMessageackDeliveredackReadsyncSince
Example command contracts
sendMessage
{
"chatId": "c123",
"clientMsgId": "uuid-123",
"type": "text",
"body": "hello",
"mediaRef": null
}Server response:
{
"status": "SUCCESS",
"chatId": "c123",
"messageId": "m789",
"serverTs": 1710412345
}Why these parameters
clientMsgId: idempotency key for retries (network loss, duplicate send tap).chatId: partition/routing key and authorization boundary.serverTs: canonical ordering hint from server side.
Why WebSocket:
- RFC 6455 supports bidirectional, persistent channel.
- Lower overhead than repeated HTTP polling for high-frequency chat updates.
Functional requirement to API mapping:
FR1->createChatFR2->sendMessage,ackDelivered, realtimenewMessageeventsFR3->syncSince,ackReadFR4-> media upload token flow +sendMessage.mediaRef
Why this mapping matters:
- Each command exists for a requirement; this avoids protocol noise and makes interview explanation crisp.
---
5) High-Level Design (~10-15 min, progressive build)
Now that contracts are clear, we can build the architecture step-by-step, starting from a minimal working flow and then hardening it for scale and failures.
Step A: Minimal working chat (single region)
Client sends sendMessage over WebSocket with chatId, clientMsgId, and body. The chat service validates membership, persists the message durably, assigns serverTs, and returns acknowledgment. Recipients fetch history via syncSince or receive push on their WebSocket. The question is: what storage model supports high-write chat history with efficient per-chat retrieval?
Components:
- WebSocket gateway + chat service
- Membership store
- Message store (Cassandra)
Decision: Cassandra for message history
- Need: very high write throughput, query-by-chat access pattern, horizontal scale.
- Research says: Cassandra modeling is query-driven; partition key is central.
- Choice:
- messages_by_chat ((chatId, dayBucket), serverTs, messageId, senderId, body/meta...)
- Why this works:
- writes append naturally; - reads latest N for a chat are efficient.
- Why not SQL joins for core path:
- heavy write scale + join-heavy read path is harder to sustain at this throughput.
- Tradeoff:
- denormalization and careful partition sizing required.
How with numbers:
- Peak writes
~463k msg/s. - Partition by
(chatId, dayBucket)keeps partitions bounded and avoids unbounded "single chat forever" partition growth.
Step B: Realtime fanout to online users
After sendMessage persists, the server must push the new message to all online participants. Each participant may be connected to a different chat server via WebSocket. The system looks up participant sessions and routes the message to their socket hosts. The question is: how do we discover where each user is connected and deliver messages across chat servers efficiently?
Components added:
- Chat server cluster
- Session registry (Redis):
userId:deviceId -> serverId/socket - Kafka topic
chat_messagespartitioned bychatId
Decision: Redis for session registry
- Need: fast lookup of where a user/device is currently connected.
- Choice: keep ephemeral connection map in Redis with heartbeat-updated TTL.
- Why this solves:
- routing layer can find target websocket host in sub-millisecond lookup.
- Why not store sessions in primary message DB:
- session data is high-churn ephemeral state; mixing with durable history increases write noise.
- How with numbers:
- at millions of concurrent sessions, heartbeat updates are frequent and short-lived; in-memory TTL store is a better fit than durable DB writes.
- Tradeoff:
- ephemeral registry can lose entries on outage; reconnect/sync logic must tolerate it.
Decision: Kafka for inter-service message stream
- Need: high-throughput fanout backbone + replay + per-chat ordering.
- Research says: ordering guaranteed within partition only.
- Choice: key messages by
chatIdso each chat is ordered in one partition. - Why this works:
- preserves in-chat order while allowing parallelism across chats.
- Why not global single partition:
- would serialize all chats and kill throughput.
- Tradeoff:
- no global total order across all chats (not needed).
How with numbers:
- Peak
~463k msg/s. - Planning assumption:
~5k msg/sper partition safe target -> need~93partitions, choose128for headroom.
Step C: Offline sync and durability guarantees
When a recipient is offline, sendMessage still persists durably, but the WebSocket push fails. On reconnect, the client calls syncSince(chatId, lastDeliveredSeq) to fetch missed messages. Push notifications (FCM/APNs) wake the app, but delivery through push is best-effort. The question is: how do we guarantee no message loss for offline users without relying on push delivery?
Components added:
DeliveryCursorstore per user/chat/devicesyncSince(chatId, lastSeq/ts)API- Push notification worker (FCM/APNs)
Decision: DB is source of truth, push is wakeup only
- Need: no message loss when user offline.
- Research says:
- FCM/APNs are notification channels, delivery is best effort; cannot replace durable chat store.
- Choice:
- write message durably first; - notify connected devices via WS; - send push for offline wakeup; - on reconnect, client syncs from durable store using cursor.
- Why this works:
- message durability does not depend on push service behavior.
- Why not "push-only offline delivery":
- dropped/throttled pushes would lose messages.
How with numbers:
- Even if push delivery success is not perfect, reconnect sync closes gaps.
- Cursor update is small metadata write, cheaper than per-message inbox rows for every recipient.
Step D: Media flow separation
When sendMessage includes media, the client first uploads the binary via a separate upload token flow, then sends the message with mediaRef pointing to the stored object. Recipients fetch media via CDN using the reference. The question is: how do we handle large media uploads without blocking or degrading the core text-message path?
Components added:
- Media service
- Object storage + CDN
Flow:
- Client requests upload token.
- Client uploads media directly to object storage.
- Chat message stores
mediaRefmetadata only.
Decision: object storage + CDN for media blobs
- Need: handle large binary payloads without hurting message path latency.
- Choice: direct-to-object-store upload, persist only reference metadata in message row.
- Why this solves:
- keeps durable chat DB focused on small metadata rows and ordered retrieval.
- Why not store binary in message DB:
- blob-heavy rows bloat compaction, storage, and read amplification for normal text retrieval.
- Tradeoff:
- added media service complexity (signed URLs, lifecycle policies, CDN invalidation).
How with numbers:
- With only message metadata in DB (~500B row), message path stays predictable.
- Binary growth is handled by blob store designed for large objects.
---
Core Entities v2 (after evolution)
After the architecture evolves, we now refine the data model from v1 to v2 so it reflects the final access patterns, not the initial guess.
Message(chatId, dayBucket, serverTs, messageId, senderId, type, body, mediaRef, status)UserChats(userId, chatId, lastActivityTs)(for listing user chats efficiently)DeliveryCursor(chatId, userId, deviceId, lastDeliveredSeq, lastReadSeq, updatedAt)ClientSession(userId, deviceId, serverId, connState, lastHeartbeatTs)MediaObject(mediaId, ownerId, size, mime, storageKey, checksum)
Why changed:
- Added
UserChatsbecause "list all chats for user" query appears. - Added
deviceIdin cursor for multi-device consistency. - Added
checksum/mimein media metadata for integrity and validation.
---
6) Deep Dives (numeric + mechanism)
With the full design in place, let us now stress-test the most failure-prone parts one by one.
Deep Dive 1: Ordering strategy
Now that we have a realtime stream, the first deep dive is ordering scope: what order we promise, and at what cost.
Bad:
- Assume global ordering for all chats.
- Why bad: impossible at scale without serializing throughput.
Good:
- Per-chat ordering only.
Great:
- Per-chat ordered stream via Kafka partition key =
chatId. - Example: messages in one chat maintain sequence, while unrelated chats process in parallel.
- How: partition-level order guarantee from Kafka model.
Deep Dive 2: Redis Pub/Sub vs durable bus
Next, after ordering, we validate delivery semantics. The key question is whether the chosen transport can recover from disconnects and consumer downtime.
Bad:
- Use Redis Pub/Sub as only delivery path.
- Why bad: Redis Pub/Sub is at-most-once; disconnected subscribers miss events.
Good:
- Redis Pub/Sub only for ephemeral presence.
Great:
- Kafka for durable stream + DB-backed sync for missed messages.
- Example: chat server restart misses transient events but reconnect sync recovers from DB.
Deep Dive 3: Per-recipient inbox write amplification
Now let us examine write amplification, because group messaging can multiply writes dramatically if modeled naively.
Bad:
- Write one inbox row per recipient for every message.
- Example: a 100-member group message generates 99 extra writes.
- How with numbers:
- at 463k msg/s peak, worst-case recipient fanout can explode writes into multi-million writes/s.
Good:
- Write message once, push to online recipients.
Great:
- Cursor-based replay (
syncSince) for offline recipients. - Tradeoff: slightly more complex client sync logic, much lower server write amplification.
Deep Dive 4: WebSocket connection failures
With throughput concerns addressed, the next topic is connection reliability and how we detect delivery uncertainty quickly.
Bad:
- Rely on TCP keepalive default only.
- Why bad: dead connections can take too long to detect.
Good:
- App-level heartbeat every few seconds.
Great:
- Heartbeat + delivery ACK timeout + reconnect with cursor sync.
- Example:
- if ackDelivered not received in N seconds, mark device uncertain and rely on replay.
Deep Dive 5: Multi-region and latency
Now that single-region behavior is clear, we evaluate what changes when users and traffic are globally distributed.
Bad:
- Single-region only.
- Why bad: high RTT for far users + region outage blast radius.
Good:
- Regional WS gateways with nearest-region routing.
Great:
- Region-local realtime serving + replicated durable message store + async cross-region propagation.
- Numeric effect:
- regional routing can save 100ms+ RTT compared with cross-ocean round trips.
- Tradeoff:
- cross-region consistency lag for non-critical metadata is accepted.
Deep Dive 6: Abuse and rate limits
Finally, we add abuse controls so system health is protected under malicious or automated traffic patterns.
Bad:
- No per-user/per-chat send limits.
- Example: bot floods a large group, harming latency for everyone.
Good:
- Per-user token bucket (e.g.,
20 msgs/10s) + per-chat anti-spam caps.
Great:
- Adaptive limits by trust score + device reputation + burst dampening.
- Return explicit throttle response (
429semantics on HTTP APIs; equivalent throttle code on WS commands).
---
7) Common mistakes candidates make
- Treating push notification channels as durable message delivery.
- Not defining ordering scope (global vs per-chat).
- Ignoring write amplification from group fanout.
- Storing media binaries in chat DB.
- Not handling reconnect sync explicitly.
- Choosing tech by name without explaining why-not alternatives.
---
8) What interviewer is actually evaluating
- Did you make ordering guarantees explicit and realistic?
- Did you protect durability independent of realtime channel health?
- Did you quantify throughput and choose partition strategy accordingly?
- Did you show how online delivery, offline recovery, and multi-device consistency work together?
- Did you ground choices in research, not generic statements?
---
9) Source notes (research used before design)
- RFC 6455 WebSocket Protocol: https://www.rfc-editor.org/rfc/rfc6455
- Kafka ordering discussion (per partition): https://kafka.apache.org/21/getting-started/introduction/
- Redis Pub/Sub semantics: https://redis.io/docs/latest/develop/pubsub/
- Cassandra data modeling docs: https://cassandra.apache.org/doc/stable/cassandra/data_modeling/intro.html
- Firebase Cloud Messaging docs: https://firebase.google.com/docs/cloud-messaging
- Apple APNs docs: https://developer.apple.com/documentation/usernotifications/setting_up_a_remote_notification_server/sending_notification_requests_to_apns
Key Takeaways
- 1Users can create chats (1:1 and small groups).
- 2Users can send and receive messages in near real-time.
- 3Offline users can fetch missed messages when they reconnect.
- 4Users can send media (metadata in chat, blob in object storage).
- 5Message delivery p95 to online recipient < 500 ms (in-region target).
Continue Learning
Real-Time Messaging
PRODesign a Global Chat Platform (1:1 + Group Messaging)
Global chat systems look simple at UI level, but the backend is a hard mix of low latency, durability, ordering boundaries, and regional failure handling.
Real-Time Messaging
PRODesign a Slack/Discord-like Workspace Messaging Platform
Workspace messaging looks like a simple chat app, but the real system problem is managing stateful WebSocket connections at scale, presence propagation across millions of users, message fanout to c...
Web Services
Design a URL Shortener
Most candidates fail this problem not because URL shortening is hard, but because they over-design too early or give choices without proving them.