Real-Time Messaging•Intermediate•45 min read

Design a WhatsApp-Style Chat System

Asked at:MetaGoogleMicrosoftSlack

Tech:Horizontal scalingShardingReplicationConsistencyCDNRate limiting

Why Chat Is Harder Than It Looks

Sending a message feels instant. You type "hello," tap send, and it appears on your friend's screen. Behind that simplicity is one of the hardest real-time systems to build correctly.

The challenge is not sending a single message — it is sending 463,000 messages per second at peak, guaranteeing that every message is delivered exactly once even when phones go offline, ensuring that two people in the same group chat always see messages in the same order, and doing all of this while a user seamlessly switches between their phone, tablet, and web browser without missing a single message.

This post walks through designing a WhatsApp-style chat system from first principles. Every decision — from the database to the message bus to the delivery protocol — will be grounded in specific research (RFCs, official docs) and justified by capacity math. We will build progressively: a minimal working chat first, then real-time fanout, offline recovery, and media handling — adding complexity only when a measurable problem demands it.

If you are reading this for the first time, follow the story in order. The requirements tell us what to build. The capacity math tells us how much pressure the system faces. The entities and APIs define the vocabulary. The high-level design assembles the machinery. And the deep dives break things on purpose to see how the design recovers.

Let us begin.

1. Requirements — Defining the Conversation

Before choosing any technology, we need to agree on exactly what this chat system does and — just as importantly — what it does not do. Chat systems are feature-rich products, and trying to design for everything at once guarantees you design nothing well.

Functional Requirements

Our system needs to support four core user actions:

Create chats. Users can start one-on-one conversations and small group chats (up to 256 members, matching WhatsApp's group limit — a number we will use later when calculating write amplification).
Send and receive messages in near real-time. When both sender and recipient are online, messages should appear almost instantly.
Offline message recovery. When a user is offline and comes back, they must see every message they missed — no gaps, no losses. This is the hardest requirement to get right.
Media messages. Users can send images, videos, and documents. The system handles the metadata; the binary blobs live elsewhere.

Non-Functional Requirements

Message delivery p95 under 500 ms for online, in-region recipients. Half a second feels instant in a conversation. Anything slower feels broken.
Durable message storage. Once the server acknowledges a message, it must not be lost — not during server crashes, not during deployments, not during datacenter failovers.
Scale to ~463,000 messages per second at peak with hundreds of millions of concurrent connections.
Tolerate partial failures. A single chat server crashing must not lose messages or leave users in a broken state.
Basic abuse protection. Rate limits prevent a single user or bot from degrading the system for everyone.

Scope Control

In scope: text messaging, media metadata, offline sync, message ordering, delivery acknowledgment, multi-device support.

Out of scope: end-to-end encryption protocol details, voice/video calling, spam ML models, stories/status features.

Now that we know what we are building, the next question is: how much load does this system actually face? The answer will determine whether we need one database or a distributed fleet, whether we can afford per-recipient writes or need a cursor-based model, and whether our message bus needs 16 partitions or 128.

Login to continue reading

You reached the preview limit. Sign in to unlock the remaining sections.

2. Capacity Estimation — The Numbers That Force Our Hand

Chat systems are deceptive. The per-message payload is tiny — a few hundred bytes of text. But multiply that by billions of messages per day, fan it out to group members, add delivery tracking metadata, and suddenly you are looking at terabytes of daily writes and hundreds of thousands of operations per second.

Throughput

Here are our traffic assumptions:

Daily active users (DAU): 200 million
Average messages sent per user per day: 20
Total messages per day: 200M × 20 = 4 billion
Average messages per second: 4B ÷ 86,400 ≈ 46,300 msg/s
Peak multiplier: 10× (New Year's Eve, major events, regional prime time)
Peak messages per second: ≈ 463,000 msg/s

Concurrent Connections

Not all 200M DAU are online simultaneously. A typical concurrency ratio for a messaging app is around 10-15% of DAU during peak hours. That gives us:

Peak concurrent connections: 200M × 0.15 = ~30 million simultaneous WebSocket connections

This number directly sizes our WebSocket gateway fleet. If each gateway server handles roughly 50,000 concurrent connections (a reasonable limit given memory for per-connection buffers and file descriptor limits), we need at least 600 gateway instances at peak. We will round to a fleet of ~750 instances for deployment headroom and rolling updates.

Message Storage

Each message row stores IDs (chatId, messageId, senderId), a timestamp, message type, the text body or a media metadata pointer, and index overhead. Let us break that down: chatId (16 bytes UUID), messageId (16 bytes), senderId (16 bytes), serverTs (8 bytes), type flag (2 bytes), text body (average ~200 bytes for a typical message), mediaRef pointer (32 bytes when present, null otherwise), plus Cassandra storage overhead per row (column names, tombstone markers, SSTable metadata — roughly ~210 bytes). That gives us approximately 500 bytes per message row.

Daily storage: 4B messages × 500 bytes = 2 TB/day
30-day hot retention: ~60 TB (single copy)
With 3× replication: ~180 TB

Media Strategy Impact

If we stored media binaries (images average 200 KB, videos several MB) inside the chat database, storage would explode from 2 TB/day to hundreds of terabytes per day, and read amplification for normal text message retrieval would be devastating — every page read from disk would pull in megabytes of binary data alongside the 500-byte text rows you actually want.

Decision: media binaries go to object storage. The message row stores only a metadata reference. We will design the media flow explicitly in Step D.

What These Numbers Force

463,000 peak writes per second means our message store must handle massive sustained write throughput with horizontal scaling. This eliminates any architecture that funnels writes through a single node or requires coordination locks on write.

30 million concurrent WebSocket connections means our gateway layer must be a horizontally scaled fleet with a session registry that can locate any user's connection in sub-millisecond time.

4 billion messages per day with group fanout means we must be extremely careful about write amplification. If a 256-member group generates a per-recipient inbox row for every message, a single group message creates 255 additional writes. At peak, this could explode write volume by orders of magnitude. We will need a smarter model.

With these numbers in hand, we know the shape and pressure of the system. The next step is to define the data we need to store — designed around the exact queries these numbers imply.

3. Core Entities (v1) — Modeling Data Around Query Patterns

In a chat system, the temptation is to model data like a relational schema: users, chats, messages, all normalized with foreign keys. But our access patterns are very specific, and our write volume (463,000 msg/s peak) does not tolerate the overhead of joins on the hot path.

Let us start from the three queries the system must answer efficiently:

Query A — Chat history: "Get the latest N messages in a specific chat." This is the bread-and-butter read — every time a user opens a conversation.
Query B — Send message: "Persist a new message to a chat and make it available for all participants." This is the highest-volume write.
Query C — Offline sync: "After reconnecting, get all messages in my chats since I was last online." This is the reliability query — it must not miss anything.

Each query maps to one or more entities.

Chat and ChatParticipant

code

Chat(chatId, type, createdAt, createdBy)
ChatParticipant(chatId, participantId, role, joinedAt)

These serve membership and authorization. Before a message is persisted, the system checks: is this sender a member of this chat? Before a recipient receives a message, the system verifies: is this user in this chat? For group chats, ChatParticipant also drives fanout — we need to know who to deliver to.

Message

code

Message(chatId, messageId, senderId, serverTs, type, body, mediaRef?)

This is the durable record of every message. It serves Query A (chat history) and Query C (offline sync). The chatId is the partition key — all messages in a conversation live together, making "get latest N in this chat" a single-partition read.

messageId deserves special attention. We generate message IDs using a Snowflake-style scheme: a timestamp component (millisecond precision) combined with a machine ID and a sequence counter. This gives us time-ordered, globally unique IDs without coordination between servers. Time-ordering matters because it means message IDs are naturally sorted — a range scan on messageId within a chat partition returns messages in chronological order. Why not UUIDs? Random UUIDs (v4) have no time ordering, which means retrieving "the latest 50 messages" requires sorting by a separate timestamp field rather than leveraging the natural key order. Why not auto-increment? Auto-increment requires a centralized counter, which becomes a bottleneck at 463,000 writes/s across hundreds of chat server instances.

serverTs is the canonical timestamp assigned by the server when the message is persisted. Clients may have inaccurate clocks, so we never trust client timestamps for ordering. The server timestamp, combined with the Snowflake message ID, gives us a consistent ordering that all clients can agree on.

ClientSession

code

ClientSession(userId, deviceId, serverId, lastHeartbeatTs)

This tracks where each user's device is currently connected. When a message needs to be delivered in real-time, the system asks: "Which WebSocket gateway server is this user connected to?" The answer comes from this entity. It is ephemeral — sessions appear when users connect and vanish (via TTL) when they disconnect or stop heartbeating.

DeliveryCursor

code

DeliveryCursor(chatId, userId, lastDeliveredSeq, lastReadSeq)

This is the most important entity in the system for avoiding write amplification, and it deserves a careful explanation.

The naive approach to offline delivery is an inbox model: when a message is sent to a group, write a copy of that message (or a pointer to it) into every recipient's personal inbox. For a 256-member group, that means 255 additional writes per message. At 463,000 peak messages/s, if even 10% are group messages with an average of 50 recipients, the inbox model generates: 46,300 group msgs/s × 50 recipient writes = 2.3 million additional writes per second — just for inbox pointers. That would roughly 6× our total write volume.

The cursor model avoids this entirely. We write the message once to the chat's message table. Each user maintains a lightweight cursor tracking the last message they received and the last message they read. On reconnect, the client says "give me everything in chat X after sequence Y" — a single range scan on an already-partitioned table. The cursor update is one small metadata write per user per chat, not one write per message per recipient.

Connecting Entities to Requirements

FR1 (create chats) → Chat + ChatParticipant define the conversation and its members.
FR2 (send/receive real-time) → Message stores the content; ClientSession routes delivery to the right gateway.
FR3 (offline recovery) → Message is the durable history; DeliveryCursor tracks each user's position so sync can replay exactly the missed window.
FR4 (media support) → Message.mediaRef points to a binary stored elsewhere. (We will add a MediaObject entity in v2.)

These entities are a v1 draft. As the architecture reveals new query patterns — listing a user's chats, multi-device cursor tracking — the model will evolve. We will revisit it after the high-level design.

With entities defined, we are ready to design the protocol that clients and servers use to communicate.

4. API Design — The Real-Time Protocol

Chat is fundamentally different from request-response systems like URL shorteners. A user does not "request" messages — messages arrive at any time, pushed by the server. This bidirectional, always-on communication pattern shapes everything about our API design.

Why WebSocket

Our protocol runs over WebSocket (RFC 6455), which provides a persistent, bidirectional channel over a single TCP connection. Once the connection is established, both client and server can send frames at any time without the overhead of a new HTTP request for each message.

Why not HTTP long-polling? Long-polling works by having the client hold an open HTTP request that the server completes when new data arrives. The client then immediately opens a new request. This creates a constant cycle of connection setup and teardown, each carrying HTTP headers (~500-800 bytes of overhead per request). At 30 million concurrent users, each potentially receiving several messages per second, long-polling would generate billions of unnecessary HTTP round-trips per day. WebSocket eliminates this overhead entirely — one connection, held open, with tiny frame headers (2-6 bytes).

Why not Server-Sent Events (SSE)? SSE provides server-to-client push, which covers the "receive messages" direction. But chat requires client-to-server communication too — sending messages, acknowledgments, read receipts. SSE is unidirectional; we would need a separate HTTP channel for client-to-server commands, effectively maintaining two connections per user instead of one. WebSocket handles both directions on a single connection.

Command Structure

With WebSocket as our transport, the client-server protocol uses typed commands. Here are the core commands:

sendMessage — the client submits a message to a chat:

json

{
  "cmd": "sendMessage",
  "chatId": "c123",
  "clientMsgId": "uuid-abc-123",
  "type": "text",
  "body": "hello",
  "mediaRef": null
}

clientMsgId is critical. Networks are unreliable — a user might tap send, lose connectivity for a moment, and tap send again. Without an idempotency key, the system records two messages. With clientMsgId, the server detects the duplicate and responds with the original message's ID and timestamp. This is a client-generated UUID, unique per send attempt.

chatId serves triple duty: it is the authorization boundary (is this user in this chat?), the partition key (messages are stored per chat), and the Kafka routing key (ensuring per-chat ordering). One field, three critical functions.

The server responds with an acknowledgment:

json

{
  "cmd": "sendMessageAck",
  "status": "SUCCESS",
  "chatId": "c123",
  "messageId": "m789",
  "serverTs": 1710412345
}

The serverTs is the canonical ordering timestamp. The client uses this, not its own clock, to position the message in the conversation.

ackDelivered / ackRead — the client confirms it has received or displayed a message. These update the DeliveryCursor and drive the delivery indicators (single tick, double tick, blue tick) that users expect.

syncSince — on reconnect, the client requests missed messages:

json

{
  "cmd": "syncSince",
  "chatId": "c123",
  "lastDeliveredSeq": "m750"
}

The server returns all messages in that chat after the cursor position. This is the mechanism that makes offline recovery work — the client knows its last position, and the server replays everything since.

newMessage — a server-pushed event when a new message arrives for the user:

json

{
  "cmd": "newMessage",
  "chatId": "c123",
  "messageId": "m789",
  "senderId": "u456",
  "serverTs": 1710412345,
  "type": "text",
  "body": "hello"
}

Mapping APIs to Requirements

FR1 (create chats) → createChat command (creates Chat + ChatParticipant records).
FR2 (send/receive real-time) → sendMessage from client + newMessage pushed to recipients + ackDelivered feedback loop.
FR3 (offline recovery) → syncSince replays missed messages using the DeliveryCursor.
FR4 (media support) → media upload token flow (HTTP) + sendMessage with mediaRef pointing to the uploaded object.

With the protocol defined, we know exactly what the system must do at every boundary. Now we can build the architecture that makes it work — starting with the simplest possible flow and adding machinery only when the numbers demand it.

5. High-Level Design — Building the Chat System Step by Step

We will build four progressive versions of the architecture. Each step introduces a component only when a specific, measurable problem forces us to. By the end, you will understand not just what the architecture looks like, but the exact pain that each piece was designed to solve.

Step A: Minimal Working Chat

Let us start with the simplest thing that delivers a message. A client connects via WebSocket to a chat service. The client sends a sendMessage command. The chat service validates membership, persists the message durably, assigns a serverTs, and returns an acknowledgment. For now, recipients fetch messages by calling syncSince — pure pull, no push.

code

┌──────────┐         ┌──────────────────┐         ┌─────────────┐
│  Client   │◀──WS──▶│  Chat Service    │────────▶│  Cassandra  │
│           │         │  (validate,      │         │ (messages)  │
└──────────┘         │   persist, ack)  │         └─────────────┘
                     └──────────────────┘

The critical question: what storage system supports 463,000 writes per second of append-only chat messages with efficient per-chat retrieval?

Tech decision: Apache Cassandra for message storage.

Our write pattern is almost entirely append-only — new messages are written, rarely updated, never joined with other tables on the hot path. Our read pattern is partition-scoped — "get latest N messages in chat X" is a single-partition scan. Cassandra is built for exactly this: high sustained write throughput across a distributed cluster, with a data model that is designed partition-first.

Why not PostgreSQL? Postgres is excellent for transactional workloads with complex queries, but our access pattern does not need transactions or joins. At 463,000 writes/s, a single Postgres instance maxes out quickly (typical ceiling is 10,000-50,000 writes/s depending on row size and indexes). We would need extensive sharding, which Postgres does not provide natively — we would be building a distributed system on top of a database that was not designed for it. Cassandra distributes writes automatically across nodes with consistent hashing, and adding capacity means adding nodes, not redesigning the shard map.

Why not DynamoDB? DynamoDB could handle our write throughput and provides a key-value access pattern. But our query requires efficient range scans within a partition (get messages between timestamp X and Y in chat Z), and DynamoDB's provisioned throughput model requires careful capacity planning for hot partitions. Cassandra's approach — where all replicas accept writes and reads are tunable per-query — gives us more operational flexibility for a write-heavy workload with predictable partition access.

Partition strategy: (chatId, dayBucket).

We partition messages by chat ID and day. Why not just chatId alone? Because a very active group chat could accumulate millions of messages over months, creating an unbounded partition that degrades read performance. Day bucketing caps partition size — a group sending 10,000 messages/day produces a ~5 MB partition, well within Cassandra's recommended 100 MB limit.

Why day buckets and not hour buckets? Most chats do not generate enough messages per hour to justify the additional partition overhead. Day buckets are the sweet spot — bounded enough to prevent growth problems, coarse enough that "get latest 50 messages" rarely needs to cross more than one or two partition boundaries. For extraordinarily active chats (10,000+ messages/day), we can introduce sub-day bucketing, but this is an optimization, not the default.

What this step gives us: a working chat system where messages are durably stored and retrievable. Users can send messages and pull history. But there is no real-time push — recipients only see new messages when they actively request them. That is the next problem to solve.

Step B: Real-Time Fanout to Online Users

Our minimal system stores messages durably, but a chat application where you have to manually refresh to see new messages is not a chat application. After a message is persisted, the server must push it to all online participants in near real-time.

The challenge: each participant may be connected to a different WebSocket gateway server. In a fleet of 750 gateway instances with 30 million concurrent connections, the sender's gateway needs to know which gateway hosts each recipient's connection — and route the message there.

This requires two new components: a way to discover where each user is connected, and a way to move messages between chat servers efficiently.

code

┌──────────┐       ┌──────────────┐       ┌─────────┐       ┌─────────────┐
│  Client   │◀─WS─▶│  WS Gateway  │──────▶│  Redis  │       │  Cassandra  │
│ (sender)  │       │  (N instances)│      │(sessions)│       │ (messages)  │
└──────────┘       └──────┬───────┘       └─────────┘       └─────────────┘
                          │ persist + publish
                          ▼
                   ┌──────────────┐
                   │    Kafka     │
                   │(128 partitions│
                   │ key=chatId)  │
                   └──────┬───────┘
                          │ consume
                          ▼
                   ┌──────────────┐       ┌─────────┐
                   │  Fanout      │──────▶│  Redis  │──lookup──▶ target WS Gateway
                   │  Workers     │       │(sessions)│             │
                   └──────────────┘       └─────────┘             ▼
                                                           ┌──────────┐
                                                           │  Client   │
                                                           │(recipient)│
                                                           └──────────┘

Tech decision: Redis for the session registry.

When a user connects to a WebSocket gateway, the gateway writes a session record to Redis: userId:deviceId → serverId. When the user disconnects (or stops heartbeating), the record expires via TTL. When a fanout worker needs to deliver a message, it looks up each recipient's session in Redis to find which gateway to route to.

Why Redis and not Memcached? Both are fast in-memory stores. But our session registry needs TTL-based expiration (for automatic cleanup when connections die) with key-pattern scanning (to find all devices for a user). Redis supports both natively. Memcached does not support key scanning — we would need a separate index to track "all sessions for user X," which is additional complexity for no performance benefit.

Why not store sessions in Cassandra? Session data is extremely high-churn. Users connect, disconnect, switch networks, and switch devices constantly. At 30 million concurrent connections with heartbeats every 15 seconds, the session registry processes 2 million updates per second just from heartbeats. Cassandra could absorb this write volume, but it would pollute the durable message store with ephemeral churn — increasing compaction pressure and storage noise for data that is meaningless after it expires. Keeping ephemeral state (sessions) separate from durable state (messages) is a foundational principle.

Tech decision: Kafka for inter-service message routing.

After a message is persisted in Cassandra, the chat service publishes it to Kafka. Fanout workers consume from Kafka, look up recipient sessions in Redis, and push messages to the appropriate gateway servers.

Why Kafka? At 463,000 messages/s peak, we need a durable, high-throughput stream that provides per-chat ordering. Kafka's partitioned log model gives us exactly this: by keying messages on chatId, all messages for a single chat land in the same partition, preserving their order (as guaranteed by Kafka's per-partition ordering model). Fanout workers consume partitions in parallel, achieving both ordering and throughput.

How many partitions? At roughly 5,000 messages/s per partition as a safe target, 463,000 peak requires at least 93 partitions. We choose 128 partitions — the headroom gives us broker-failure budget (one broker out of 8 going down redistributes 16 partitions across 7 survivors, keeping per-partition load within limits).

Why not RabbitMQ? RabbitMQ is designed for task queuing with per-message acknowledgment and redelivery. It excels at ensuring each message is processed exactly once by a single consumer. But our pattern is fundamentally different: we need a durable log with consumer-group-based parallel consumption and offset replay for recovery. Kafka's log abstraction fits this naturally. RabbitMQ would require us to build replay logic on top of a system that was not designed for it.

Why not a direct server-to-server push (skipping Kafka entirely)? The sender's chat server could look up recipient sessions and push directly to their gateway servers. This is simpler but fragile: if the sender's server crashes mid-fanout, some recipients get the message and others do not — with no record of who received what. Kafka provides a durable intermediate state: even if the fanout worker crashes, the message remains in the Kafka partition and a restarted worker picks up where it left off.

What this step gives us: real-time message delivery to online users. A message is persisted, published to Kafka, consumed by fanout workers, routed via session lookups, and pushed to recipient WebSocket connections. But we have not solved offline delivery — if a recipient is not connected, the message sits in Cassandra and nobody tells them about it. That is next.

Step C: Offline Recovery and Delivery Guarantees

The hardest problem in chat is not sending messages to online users — it is guaranteeing that offline users never miss anything. A user's phone might be in airplane mode for an hour, or their network might drop for 30 seconds. When they come back, every message must be there.

Our solution has three layers, each addressing a different failure mode:

Layer 1: Durable storage as source of truth. The message is always written to Cassandra before any delivery attempt. If every other system fails — Kafka is down, push notifications are broken, WebSocket gateways are crashing — the message is safe in Cassandra. This is non-negotiable.

Layer 2: Push notifications as a wake-up signal. When the fanout worker finds that a recipient has no active session in Redis (they are offline), it sends a push notification via FCM (Android) or APNs (iOS). This notification is not the message itself — it is a wake-up signal that tells the app "you have new messages, connect and sync." Why not deliver the message through push? Because push notification delivery is best-effort. FCM and APNs documentation explicitly state that delivery is not guaranteed — notifications can be delayed, throttled, collapsed, or dropped entirely. If we relied on push for message delivery, every dropped notification would mean a lost message. Push is a courtesy knock on the door; the database is the room where the messages actually live.

Layer 3: Cursor-based sync on reconnect. When the client reconnects (either from the push wake-up or from the user opening the app), it calls syncSince(chatId, lastDeliveredSeq) for each chat. The server reads all messages in that chat after the cursor position from Cassandra and streams them to the client. The client then sends ackDelivered to advance the cursor. This closes any gaps — regardless of what happened to push notifications, WebSocket connections, or Kafka consumers in the meantime.

code

ONLINE DELIVERY:
  Message ──▶ Cassandra ──▶ Kafka ──▶ Fanout Worker ──▶ Session Lookup
                                                            │
                                              ┌─────────────┼──────────────┐
                                              ▼             ▼              ▼
                                          Online?       Online?       Offline?
                                           Push          Push        Send push
                                          via WS        via WS      notification
                                                                    (wake-up only)

OFFLINE RECOVERY (on reconnect):
  Client ──syncSince──▶ Chat Service ──range scan──▶ Cassandra
  Client ◀──missed messages──────────────────────────┘
  Client ──ackDelivered──▶ Update DeliveryCursor

What this step gives us: a complete delivery guarantee. Online users get messages in real-time via WebSocket. Offline users get a push wake-up and sync on reconnect. The durable store is always the source of truth. No message is lost, regardless of the delivery channel's health.

What is still missing: media. A user sends a photo, and right now we have no place to put the binary blob. That is the final piece.

Step D: Media Flow Separation

When a user sends a photo in a chat, the naive approach is to send the binary through the same WebSocket connection and store it in the same database as text messages. This would be disastrous. A 3 MB photo flowing through the WebSocket would block all other messages on that connection (WebSocket is ordered per connection). Storing it in Cassandra would bloat compaction, destroy read amplification for text messages, and consume storage at a rate of hundreds of terabytes per day.

Instead, we separate the media flow completely:

code

1. Client ──request upload token──▶ Media Service ──▶ signed URL
2. Client ──upload binary directly──▶ Object Storage (S3/GCS)
3. Client ──sendMessage with mediaRef──▶ Chat Service (normal message flow)
4. Recipient ──fetch media via CDN──▶ Object Storage

Step 1: The client requests an upload token from the media service. This is a lightweight HTTP call that returns a signed URL (pre-authenticated direct upload URL) and a media ID.

Step 2: The client uploads the binary directly to object storage (S3, GCS, or equivalent) using the signed URL. The chat service never touches the binary — the upload goes straight from the client to the blob store. This keeps the chat servers focused on their core job: small, fast message processing.

Step 3: Once the upload completes, the client sends a normal sendMessage with mediaRef pointing to the stored object. This message row in Cassandra is still ~500 bytes — it contains a reference, not the binary.

Step 4: When recipients view the message, their client fetches the media via a CDN URL derived from the mediaRef. The CDN caches popular media (viral group photos, frequently viewed images) and serves them from edge locations close to the user.

Why not serve media directly from the API servers? At 200M DAU, even if 10% of messages include media, that is 400M media fetches per day. Serving these through our chat servers would consume bandwidth and CPU that should be dedicated to message routing. Object storage + CDN is purpose-built for large binary delivery at scale; our chat servers are purpose-built for small message throughput.

What this step gives us: the complete chat system. Text messages flow through the real-time path. Media binaries flow through a separate, optimized path. Both reference each other through the mediaRef field.

The Life of a Message — One Send, Start to Finish

We have built many components across four steps. Let us trace exactly what happens when Alice sends "hello" to a group chat with Bob (online) and Carol (offline):

Alice's client sends sendMessage over her WebSocket connection to WS Gateway #47 (the instance she is connected to). The command includes chatId, clientMsgId (for idempotency), and the message body.

Gateway #47 validates membership — confirms Alice is a participant in this chat by checking ChatParticipant.

Gateway #47 persists the message to Cassandra — writes to partition (chatId, today's dayBucket). The message gets a Snowflake messageId and serverTs. The message is now durable. Even if everything else fails from this point, the message is safe.

Gateway #47 returns sendMessageAck to Alice — her client shows the single checkmark (sent, server-acknowledged).

Gateway #47 publishes the message to Kafka — keyed on chatId, so it lands in the partition that preserves this chat's ordering.

Fanout worker consumes the message from Kafka — looks up all participants (Alice, Bob, Carol) in ChatParticipant, then checks their sessions in Redis.

Bob is online — Redis shows bob:phone → gateway#12. The fanout worker pushes the message to Gateway #12, which delivers it to Bob's WebSocket. Bob's client sends ackDelivered back, updating his DeliveryCursor. Alice sees the double checkmark.

Carol is offline — Redis has no active session for Carol. The fanout worker sends a push notification via FCM/APNs: "New message in Group Chat." Carol's phone displays the notification.

Carol opens the app 20 minutes later — her client connects to WS Gateway #83 (whichever gateway the load balancer assigns), then calls syncSince(chatId, lastDeliveredSeq) for each of her chats. The server reads missed messages from Cassandra and streams them. Carol sees Alice's "hello" along with any other messages she missed. Her client sends ackDelivered, updating her cursor.

Total latency for the online path (steps 1-7): ~100-300 ms in-region, well within our 500 ms p95 target. The bottleneck is typically the Cassandra write (~5-10 ms) plus Kafka publish and consume (~10-30 ms) plus network hops.

Notice the dependency chain for message durability: Cassandra only. Kafka is a delivery optimization, not a durability mechanism. If Kafka is down, messages are still safe in Cassandra — they just are not pushed in real-time, and recipients sync on their next reconnect.

What if the Kafka publish fails after Cassandra succeeds? Steps 3 and 5 are separate operations — the message is durable but never enters the fanout stream. Without a catch-up mechanism, that message would only reach recipients when they happen to open the app and sync. To close this gap, a background catch-up scanner periodically reads recently persisted messages from Cassandra and checks whether they have a corresponding Kafka offset (tracked via a lightweight published flag on the message row or a separate outbox table). Any message older than a few seconds without a confirmed publish is re-published to Kafka. This scanner runs at low frequency (every 5-10 seconds) and processes only the small fraction of messages that slipped through — typically near zero under normal operation, but critical during Kafka outages or gateway crashes.

Now that the architecture is complete, the data model needs to catch up.

6. Core Entities (v2) — How the Architecture Changed Our Data

When we defined entities in Section 3, we were working from requirements alone. Building the architecture revealed new query patterns and new operational needs.

code

Message(chatId, dayBucket, serverTs, messageId, senderId, type, body, mediaRef, status)
UserChats(userId, chatId, lastActivityTs)
DeliveryCursor(chatId, userId, deviceId, lastDeliveredSeq, lastReadSeq, updatedAt)
ClientSession(userId, deviceId, serverId, connState, lastHeartbeatTs)
MediaObject(mediaId, ownerId, size, mime, storageKey, checksum)

Message gained dayBucket and status. The Cassandra partition strategy from Step A requires dayBucket as part of the partition key. status supports moderation — flagged or deleted messages can be marked without physically removing the row (preserving audit trails and consistent sequence numbering).

UserChats is new. It did not exist in v1 because we had not yet faced the "list all chats for a user" query. When a user opens the app, the first screen shows all their conversations with the latest message preview. Without UserChats, answering this requires scanning all ChatParticipant rows for the user and then querying each chat's latest message — an expensive fan-out read. UserChats denormalizes this into a single partition read per user, updated whenever a chat receives a new message.

DeliveryCursor gained deviceId. A user with a phone and a laptop needs independent cursors for each device. The phone might be offline while the laptop is active — they are at different positions in the message stream. Without per-device cursors, syncing one device would advance the cursor for all devices, causing the other to skip messages.

MediaObject is new. Step D's media flow requires metadata tracking — who uploaded it, how large it is, what MIME type, and a checksum for integrity verification. The storageKey maps to the actual object in blob storage.

With the full architecture and refined data model in place, we are ready to stress-test the weak points.

7. Deep Dives — Breaking Things on Purpose

An architecture diagram shows how a system works when everything goes right. Deep dives explore what happens when things go wrong, when scale hits limits, and when edge cases collide. We will examine six critical areas.

Deep Dive 1: Message Ordering — What "In Order" Actually Means

Two users in a group chat both tap send at the same instant. Their messages arrive at different gateway servers at slightly different times. Which message appears first? And can different group members see them in different orders?

We established in Step B that Kafka partition key = chatId gives us per-chat ordering. But let us go deeper into what this actually guarantees and where it breaks down.

The server assigns order, not the client. When two messages arrive at different gateways simultaneously, each gateway independently writes to Cassandra and publishes to Kafka. Kafka orders them by arrival time at the partition leader. The serverTs assigned by each gateway may differ by a few milliseconds, but the Kafka offset within the partition is the canonical sequence. All consumers — all fanout workers, all sync queries — see these two messages in the same order.

But what about the client display? Here is a subtlety most designs miss. Bob is online and receives both messages via WebSocket push. They arrive in Kafka order. Good. But what if Bob's network has a micro-delay and message #2 arrives before message #1 over the WebSocket? Bob's client must buffer incoming messages and sort them by serverTs before display, inserting any out-of-order arrivals into the correct position in the conversation. This is a client-side responsibility — the server guarantees a consistent ordering exists, but network jitter means clients must enforce it locally.

What we do NOT guarantee: global ordering across chats. Messages in Chat A and Chat B have no ordering relationship, and we do not pretend otherwise. Kafka processes them on different partitions in parallel. A user might see a reply in Chat B before seeing the original message in Chat A that prompted it. This is acceptable — and unavoidable without serializing all messages globally, which would collapse throughput to a single partition's limit (~5,000 msg/s vs our 463,000 msg/s requirement).

Conflict resolution for simultaneous edits? In our v1 design, messages are immutable after send (no edit feature in scope). If we later add editing, the serverTs of the edit operation provides the ordering — last-write-wins, resolved by the server clock. A vector-clock or CRDT approach would be needed for true concurrent editing of the same message, but that is out of scope.

Deep Dive 2: Delivery Guarantees Under Failures

In Step C, we built three layers of delivery guarantees: Cassandra persistence, WebSocket push, and cursor-based sync. Let us stress-test each failure point.

Scenario: a fanout worker crashes mid-delivery. A group message with 100 participants is being delivered. The fanout worker has pushed the message to 40 recipients and crashes. What happens to the other 60?

The message is safe in Cassandra (Layer 1). The Kafka consumer group detects the worker's failure and reassigns its partitions to surviving workers. But the Kafka offset was not committed (the worker crashed before finishing). The surviving worker re-reads the message from the partition and re-delivers to all 100 recipients. The 40 who already received it get a duplicate — which is fine because the messageId is deterministic, and the client's deduplication logic (based on messageId) silently discards the second copy. The 60 who were missed receive it for the first time. No message is lost, no message is permanently duplicated in the conversation UI.

Scenario: Kafka is completely unreachable for 5 minutes. The chat service persists messages to Cassandra successfully but cannot publish to Kafka. During these 5 minutes at peak traffic (463,000 msg/s), roughly 139 million messages are stored but not fanned out. Online recipients do not receive real-time pushes. What happens?

The messages are durable (Layer 1 holds). Push notifications are also disrupted since the fanout worker (which triggers push) is not consuming. But when Kafka recovers, the chat service's publish backlog (if buffered) or a catch-up mechanism replays the messages. Meanwhile, any user who opens the app triggers syncSince, which reads directly from Cassandra — bypassing Kafka entirely. The sync path has zero dependency on Kafka. Real-time push is degraded, but no messages are lost, and active users still see their messages within seconds of opening the app.

Scenario: a WebSocket gateway crashes with 50,000 active connections. All 50,000 users instantly lose their connections. Their clients detect the disconnection (via TCP reset or heartbeat timeout) and attempt to reconnect — the load balancer assigns them to other healthy gateways. On reconnect, each client runs syncSince for all their chats, recovering any messages delivered while they were briefly disconnected. The Redis session records for the crashed gateway expire via TTL (typically 30-60 seconds). During that TTL window, fanout workers may still try to push messages to the dead gateway — these pushes fail, and the workers treat the recipient as offline (triggering push notification fallback). Once the TTL expires, Redis correctly reflects that those users have no active session.

What is the total message loss across all three scenarios? Zero. The system is designed so that the only operation that makes a message durable is the Cassandra write. Everything else — Kafka, Redis, WebSocket, push notifications — is a delivery optimization. If any of them fail, the sync path recovers from Cassandra.

What about Cassandra itself? With a replication factor of 3, every message is written to three nodes before the write is acknowledged (at QUORUM consistency, two of three must succeed). A single node failure loses no data — the other two replicas serve reads while the failed node is replaced and re-streams its data. Even two simultaneous node failures in the same replica set are survivable if the third node is healthy. The scenario that would cause data loss — all three replicas of a partition failing simultaneously before any hint or repair can run — is extraordinarily unlikely in a properly operated cluster with rack-aware placement across failure domains.

Deep Dive 3: Write Amplification in Group Messaging

A user sends a message to a 256-member group. In a naive inbox-per-recipient design, this generates 255 additional writes. Let us calculate the actual cost and see why the cursor model is essential.

Inbox model math at peak: If 10% of peak messages are group messages (46,300 msg/s) with an average effective fanout of 50 recipients (accounting for a mix of small and large groups), the inbox model generates: 46,300 × 50 = 2,315,000 additional writes per second. Combined with the base 463,000 writes, total write volume becomes ~2.78 million writes/s — a 6× amplification. This would require roughly 6× the Cassandra cluster capacity just for inbox pointer maintenance.

Cursor model math: Write the message once (463,000 writes/s stays at 463,000 writes/s). Cursor updates happen only when a user syncs or acknowledges — which is bounded by the number of active users, not the number of messages × recipients. At 30 million concurrent users, even if every user syncs every 30 seconds, that is 1 million cursor updates per second — a fixed, predictable load independent of group size or message volume.

The tradeoff: the cursor model shifts complexity to the client. On reconnect, the client must sync each chat individually, and for a user in many active groups, this sync burst can involve dozens of range scans. In practice, clients sync chats in priority order (most recent activity first, visible chats before background chats) and paginate large catch-up windows. The server-side savings — 6× less write amplification — are worth this client-side complexity.

What about the "unread count" problem? When a user opens the app, they want to see "Chat X: 3 new messages." With the inbox model, this is trivial — count inbox rows. With the cursor model, we compute it: currentMaxSeq(chat) - lastDeliveredSeq(user, chat). This requires one read per chat, which is fast for the active chats displayed on screen. For the badge count (total unread across all chats), we maintain a lightweight counter that increments on message delivery and decrements on cursor advancement.

Deep Dive 4: Multi-Device Consistency

A user has WhatsApp open on their phone and their laptop simultaneously. Alice sends them a message. Both devices must show it. If they read it on the laptop, the phone should show it as read too. This is more nuanced than it appears.

The per-device cursor model: In v2, DeliveryCursor includes deviceId. Each device maintains its own lastDeliveredSeq — the phone might be at message #750 while the laptop is at #780 (the laptop received the last 30 messages while the phone was locked). When the phone wakes up, it syncs from #750 independently.

But read receipts must cross devices. When the user reads a message on the laptop, the laptop sends ackRead(chatId, messageId). The server updates lastReadSeq — but this read status must propagate to the phone (to dismiss the notification badge and show the message as read). The server pushes a readSync event to all of the user's other connected devices via their WebSocket connections. If the phone is offline, it picks up the updated lastReadSeq on its next syncSince.

The edge case: simultaneous sends from two devices. The user types "yes" on their phone and "okay" on their laptop at the same moment. Both devices send sendMessage with different clientMsgId values. Both messages are persisted — they are genuinely two separate messages, and the system correctly delivers both to the chat. The conversation shows both "yes" and "okay" from the same user, which matches WhatsApp's actual behavior.

Device limit and session management: We cap each user at 5 simultaneous devices (matching WhatsApp's linked devices limit). The session registry in Redis stores up to 5 entries per user. When a 6th device connects, the oldest session is invalidated and that device is forced to re-authenticate. This bounds the per-user fanout — delivering a message to one user touches at most 5 sessions, not an unbounded number.

Deep Dive 5: Multi-Region Deployment

A user in São Paulo sends a message to a user in Mumbai. If the chat service runs in a single US-East region, both users pay cross-ocean round-trip latency — ~200 ms from São Paulo to US-East, plus ~200 ms from US-East to Mumbai. That is ~400 ms just in network transit, consuming most of our 500 ms p95 budget before any processing happens.

The good approach: regional WebSocket gateways with a centralized message store.

Deploy WebSocket gateway fleets in each major region (US-East, EU-West, AP-South, SA-East). Users connect to the nearest gateway, reducing connection latency to ~10-30 ms. Message persistence and Kafka still run in a primary region. The gateway proxies the sendMessage to the primary region for persistence, then Kafka fanout delivers to gateways in all regions.

This halves the latency for the sender (São Paulo → SA-East gateway is fast; SA-East → US-East for persistence is one cross-region hop). But the recipient in Mumbai still pays US-East → AP-South for the push delivery. Total: ~200-250 ms, a significant improvement over 400 ms.

The great approach: region-local persistence with async cross-region replication.

Deploy Cassandra with multi-datacenter replication (Cassandra natively supports this with NetworkTopologyStrategy). Messages are written to the local region's Cassandra cluster and asynchronously replicated to other regions. Kafka topics are per-region, and cross-region message routing happens via a bridge consumer that reads from one region's Kafka and publishes to another's.

The sender's message is persisted locally in ~5-10 ms and pushed to local recipients immediately. Cross-region delivery happens asynchronously with typical replication lag of 100-500 ms. For the Mumbai recipient, the message arrives via the AP-South Kafka topic within 200-300 ms of the original send — well within our 500 ms target.

The tradeoff: cross-region consistency lag. If two users in different regions are in the same chat, they might briefly see messages in slightly different orders during the replication window. Since our ordering is per-chat and the authoritative order is determined by serverTs at the origin region, clients reconcile any temporary inconsistency on the next sync. For a chat application, 200-500 ms of cross-region lag is imperceptible.

Deep Dive 6: Abuse Protection

A bot connects to the chat service and blasts 1,000 messages per second into a 256-member group. Without rate limiting, this generates 1,000 Cassandra writes/s, 1,000 Kafka publishes, and up to 256,000 fanout attempts per second — from a single malicious user. Multiply by several bots, and system health degrades for everyone.

Per-user rate limiting: a token bucket per user with a budget of 20 messages per 10-second window. This allows natural burst typing (several quick messages in a row) while capping sustained throughput at 2 msg/s per user. At 200M DAU, even if every user is sending at maximum rate, the system-wide load is 400M msg/s — close to our peak design. But in practice, the vast majority of users send far below the limit. The rate limit primarily constrains bots and misbehaving clients.

Per-chat rate limiting: a separate bucket per chat caps total message throughput at, say, 100 messages per 10-second window. This prevents a large group from being flooded even if many individual users are below their personal limits.

Adaptive limits: users with established history (account age > 30 days, verified phone, consistent usage) get slightly higher limits. New accounts get stricter limits. Accounts that repeatedly hit rate limits get progressively throttled (exponential backoff on the limit itself). Device reputation signals (known emulator, unusual connection patterns) can trigger additional scrutiny.

The chat server enforces limits before persisting — a rate-limited message is rejected with a throttle response before it touches Cassandra or Kafka, ensuring that abuse does not consume durable storage or fanout resources.

8. Tying It All Together — From One Box to a Global Chat System

We started with a single chat service and a database. Let us look at what we built and why each piece was added:

Step	Problem Solved	Components Added	Number That Forced It
A	Durable message storage + retrieval	Chat Service + Cassandra	463k peak writes/s; append-only, partition-scoped reads
B	Real-time fanout to online users	WS Gateway fleet + Redis sessions + Kafka	30M concurrent connections; cross-server routing
C	Offline recovery + delivery guarantees	DeliveryCursor + syncSince + Push worker	FCM/APNs are best-effort; DB must be source of truth
D	Media without degrading message path	Media Service + Object Storage + CDN	Media binaries would 100x storage; read amplification

The data model evolved alongside:

Entity	In v1?	In v2?	What Triggered the Change
Message	Yes	Yes (+ dayBucket, status)	Cassandra partition strategy; moderation needs
DeliveryCursor	Yes	Yes (+ deviceId)	Multi-device sync requires per-device cursors
UserChats	No	Yes	"List my chats" query appeared when building the app
MediaObject	No	Yes	Step D's upload flow needed metadata tracking
ClientSession	Yes	Yes (+ connState)	Gateway crash handling needs connection state

Every component traces to a number. Every entity field traces to a query pattern. Nothing is here because "everyone uses Kafka" or "Redis is popular." Everything is here because the math demanded it.

9. Common Mistakes — What Gets Candidates Rejected

Treating push notifications as durable delivery. FCM/APNs are wake-up signals, not message transport. If your design loses messages when push fails, it is broken.

Not defining ordering scope. "Messages are ordered" is not a design. Per-chat ordering via Kafka partition key, with client-side reorder buffering for network jitter, is a design.

Ignoring write amplification. The difference between the inbox model (6× write amplification) and the cursor model (fixed cursor updates) is the difference between a system that scales and one that collapses under group messaging load.

Storing media in the message database. Blobs in Cassandra destroy read amplification for text messages and make storage planning impossible.

No reconnect sync mechanism. If your design cannot explain exactly how a user who was offline for 20 minutes gets every missed message, it has a gap.

Choosing technologies by name without mechanism. "We use Kafka" is not enough. "We key on chatId for per-chat ordering, use 128 partitions for 463k peak throughput, and rely on offset replay for consumer crash recovery" is a design decision.

Forgetting multi-device. A user with a phone and a laptop is the common case, not the edge case. Per-device cursors and cross-device read sync are core requirements.

10. What the Interviewer Is Actually Evaluating

Did you make ordering guarantees explicit? Not just "Kafka orders things" but "per-partition ordering keyed on chatId, with client-side sort buffer for network jitter, and no global cross-chat ordering."

Did you protect durability independently from real-time delivery? The message must survive in Cassandra regardless of what happens to Kafka, Redis, WebSocket, or push notifications.

Did you quantify write amplification? The inbox-vs-cursor trade-off, with concrete math for group fanout at peak, separates engineers who think about scale from those who draw boxes.

Did you show how online delivery, offline recovery, and multi-device work together? These three capabilities must compose cleanly. The "Life of a Message" trace — from send through persist, fanout, sync, and ack — demonstrates that they do.

Did you ground decisions in research? Citing RFC 6455's bidirectional semantics for WebSocket, Kafka's per-partition ordering guarantee, and FCM's best-effort delivery model shows your decisions come from understanding, not memorization.

Final Thought

A chat system is deceptively simple in concept — one user sends a message, another receives it. But behind that simplicity is a careful layering of durability, delivery optimization, and failure recovery. The database is the source of truth. The message bus is the delivery accelerator. The WebSocket is the real-time channel. The cursor is the recovery mechanism. And push notifications are the wake-up call.

We started with a single service writing to a database. We ended with a multi-region, multi-device architecture with durable persistence, ordered fanout, cursor-based sync, and separated media handling. Not because we planned it all upfront, but because each problem we encountered — throughput limits, fanout complexity, offline gaps, media bloat — forced exactly one architectural addition.

That progressive approach — start simple, let numbers drive decisions, and always know what breaks first — is what makes a system design convincing in an interview and reliable in production.

Reading Progress

On This Page

Continue Learning

Real-Time Messaging

PRO

Design a Global Chat Platform

Alice in New York sends a message to Bob in Tokyo. Within 300 milliseconds, that message must be written durably, ordered correctly against Bob's other conversations, delivered to his phone and lap...

Real-Time Messaging

PRO

Design a Slack/Discord-like Workspace Messaging Platform

The CEO opens Slack at 8:47 AM and types in #announcements: "@channel We're acquiring TechCorp. More details at the all-hands in 30 minutes." She presses Enter. Within 2 seconds, 47,000 employees a...

Web Services

Design a URL Shortener

A URL shortener looks trivial. Accept a long URL, return a short one, redirect anyone who clicks it. You could build a working prototype in an afternoon with a hash map and a web server.

🎉 Launch Sale!

30% off annual plans with code LAUNCH30

View Pricing