Design a Notification System
The Invisible Infrastructure Behind Every Alert
You receive dozens of notifications every day β a shipping update, a login verification code, a friend's message, a price drop alert. Each one feels trivial. But behind that "Your order has shipped" push notification is a system that must decide which channel to use (push, email, or SMS), respect your quiet hours, retry if the first attempt fails, avoid sending duplicates, handle provider outages without losing the notification, and do all of this at a rate of 74,000 delivery attempts per second during peak traffic.
Notification systems are deceptively complex because they sit at the intersection of user preferences, multiple unreliable external providers, retry logic, and strict delivery guarantees. The push notification channel alone β the one most people assume "just works" β is best-effort by design. FCM and APNs explicitly do not guarantee delivery. If your system treats push as a reliable transport, you will lose notifications silently.
This post walks through designing a multi-channel notification system from first principles. We will build progressively: accept and queue notifications first, then route them through user preferences and templates, add per-channel delivery with retry logic, and finally layer on observability and reconciliation. Every decision will be justified by capacity math and grounded in how providers actually behave.
Let us begin.
1. Requirements β What This System Must Do
A notification system serves two masters: the internal services that trigger notifications (order service, auth service, marketing platform) and the end users who receive them. The system must be fast enough that producers are not blocked, reliable enough that no notification is silently lost, and respectful enough that users are not spammed at 3 AM.
Functional Requirements
- Send notifications via push, email, and SMS. Each channel has different providers (FCM/APNs for push, SMTP/SendGrid for email, Twilio/SNS for SMS), different latency profiles, and different failure modes. The system must abstract these differences behind a unified interface.
- Respect user preferences. Users can opt out of specific channels, set quiet hours (no notifications between 10 PM and 8 AM in their timezone), and configure language/locale. The system must check these before attempting delivery, not after.
- Retry failed deliveries and track final status. Provider failures are common β transient 5xx errors, rate limit 429 responses, network timeouts. The system must retry intelligently and provide an auditable trail of every delivery attempt.
Non-Functional Requirements
- Ingest throughput: p95 under 100 ms. Producers (internal services) call the notification API and need a fast acknowledgment. They cannot wait for the notification to actually be delivered β that might take seconds (push) or minutes (email with retries).
- At-least-once delivery to channel workers. No notification trigger should be silently dropped between ingest and delivery attempt.
- Channel failure isolation. An SMS provider outage must not delay push notifications or emails. Each channel operates independently.
- Auditable delivery state. For any notification, we can answer: was it delivered? Which channel? How many attempts? What was the final status?
- Basic abuse protection. Rate limits on both the producer side (prevent runaway services from flooding the system) and the provider side (stay within FCM/SMTP/Twilio quotas).
Scope Control
In scope: notification orchestration, multi-channel delivery, retry logic, user preferences, delivery tracking.
Out of scope: template editor UI, campaign builder, A/B testing of notification content.
Now that we know what we are building, the next question is: how much load does this system face? The answer determines whether we can process synchronously or must decouple with queues, and how large our worker fleet needs to be.
Login to continue reading
You reached the preview limit. Sign in to unlock the remaining sections.
2. Capacity Estimation β The Numbers That Shape the Pipeline
Notification systems have a unique traffic pattern: a single trigger event can fan out to multiple channels, and each channel attempt can spawn retries. The effective throughput is much higher than the ingest rate.
Throughput
- Notification triggers per day: 500 million (from all internal services combined)
- Average channels per trigger: 1.6 (some notifications go to push only; others go to push + email)
- Total channel delivery attempts per day: 500M Γ 1.6 = 800 million
- Average attempts per second: 800M Γ· 86,400 β 9,260/s
- Peak multiplier: 8Γ (flash sales, security incident alerts, batch marketing sends)
- Peak attempts per second: β 74,000/s
Worker Sizing
Each channel worker makes an outbound API call to a provider (FCM, SendGrid, Twilio). These calls have variable latency β push is typically 20-50 ms, email 50-200 ms, SMS 100-300 ms. If we assume an average effective throughput of 500 delivery calls per second per worker (accounting for connection overhead, serialization, and retry logic), we need:
- Push workers (80% of load): 59,200 peak attempts/s Γ· 500/worker β ~120 workers
- Email workers (15%): 11,100/s Γ· 500/worker β ~22 workers
- SMS workers (5%): 3,700/s Γ· 500/worker β ~8 workers
We round up and add headroom for rolling deployments: ~150 push, ~30 email, ~12 SMS workers at peak. This is why per-channel queues matter β each channel scales its worker fleet independently based on its own traffic profile.
Storage
Each delivery attempt generates a DeliveryAttempt record tracking the attempt number, status, provider response, and timing. Let us break down the row size: deliveryId (16 bytes), eventId (16 bytes), channel enum (2 bytes), provider (8 bytes), status enum (2 bytes), attemptNo (4 bytes), nextRetryAt timestamp (8 bytes), errorCode (32 bytes for provider error string), providerMessageId (48 bytes for external correlation), latencyMs (4 bytes), timestamps (created, updated, terminal β 24 bytes), plus Postgres row overhead and index entries (~236 bytes). That gives us approximately 400 bytes per delivery attempt row.
- Daily storage: 800M attempts Γ 400 bytes = 320 GB/day
- 30-day retention: ~9.6 TB (single copy)
Retry metadata and delivery logs accumulate fast. We will need explicit retention policies β detailed attempt rows live for 14-30 days, while compact aggregates (daily success/failure counts per channel) are retained for a year.
What These Numbers Force
74,000 peak attempts per second means synchronous provider calls inside the API request path are impossible. A single slow provider (email taking 200 ms) would block the API and cascade into producer timeouts. The pipeline must be asynchronous: accept fast, process in background.
Independent channel load profiles (80/15/5 split) mean a single shared queue would waste resources β push workers would sit idle while SMS workers are overwhelmed during an SMS outage. Per-channel queues with independent worker pools are mandatory.
320 GB/day of delivery logs means we cannot keep everything forever. Retention policies and aggregation are architectural requirements, not operational nice-to-haves.
With these numbers established, we can define the data that flows through the system.
3. Core Entities (v1) β Modeling the Notification Lifecycle
A notification has a clear lifecycle: it is triggered, preferences are checked, content is rendered, delivery is attempted (possibly multiple times across multiple channels), and a final status is recorded. Our entities must model each stage.
NotificationEvent
NotificationEvent(eventId, userId, type, payloadRef, priority, createdAt)This is the immutable trigger record β the starting point of every notification. When an internal service calls our API, this is what gets created.
eventId is the idempotency key. Producers may retry their API call if they do not receive a response (network timeout, load balancer hiccup). Without an idempotency key, a retry creates a duplicate notification β the user gets two "Your order has shipped" emails. With eventId, the system detects the duplicate and returns the original response. This is not optional; at 74,000 peak ingest/s, even a 1% retry rate produces 740 duplicate triggers per second without deduplication.
payloadRef is a reference to the notification content (stored separately), not the content itself. Why not embed the full payload? Because the same notification event fans out to multiple channels, each with different content formats (push has a title + body + icon; email has HTML; SMS has a 160-character text). Storing a reference keeps the event record lean (~200 bytes instead of potentially kilobytes), which matters when these records flow through queues at 74,000/s.
priority distinguishes transactional notifications (OTP codes, security alerts β must arrive within seconds) from marketing notifications (weekly digest, promo offers β can tolerate minutes of delay). This field drives queue routing: high-priority notifications get dedicated fast lanes.
UserPreference
UserPreference(userId, channel, enabled, quietStart, quietEnd, timezone, locale)This entity controls whether a notification reaches a user on a given channel. The orchestrator checks preferences before enqueueing any delivery attempt.
quietStart/quietEnd + timezone together implement quiet hours. A notification triggered at 2 AM UTC might be perfectly fine for a user in London (2 AM β probably sleeping, but they opted in) and inappropriate for a user in New York (9 PM β fine). The orchestrator converts the current time to the user's timezone and checks whether it falls within their quiet window. If it does, the notification is either delayed until the window ends or suppressed entirely, depending on priority β an OTP code ignores quiet hours; a marketing email respects them.
locale determines which template variant to render. A user with locale=es-MX gets the Spanish version of the "order shipped" email.
DeliveryAttempt
DeliveryAttempt(deliveryId, eventId, channel, provider, status, attemptNo, nextRetryAt, errorCode, providerMessageId, latencyMs, createdAt, terminalAt)This is the operational heart of the system β one row per delivery attempt. A single notification event might generate multiple attempts: attempt #1 to FCM fails with a 503, attempt #2 succeeds 15 seconds later. Both are recorded.
attemptNo tracks how many times we have tried this specific channel. Combined with the retry policy (max 5 attempts for transactional, max 3 for marketing), it determines whether to retry or mark the attempt as terminally failed.
nextRetryAt tells the retry scheduler when to re-enqueue this attempt. It is calculated using exponential backoff with jitter: attempt #1 retries after ~5 seconds, attempt #2 after ~15 seconds, attempt #3 after ~45 seconds. The jitter prevents thousands of failed attempts from retrying at the exact same moment (which would create a retry storm against an already-struggling provider).
providerMessageId is the external correlation ID returned by the provider (FCM message ID, SendGrid message ID, Twilio SID). When the provider later sends a delivery receipt via webhook, this ID is how we match it to our internal delivery attempt.
status follows a state machine: PENDING β SENT β DELIVERED on the happy path, or PENDING β SENT β FAILED β RETRYING β SENT β DELIVERED with retries, or PENDING β SENT β FAILED β TERMINAL when max retries are exhausted.
Template
Template(templateId, channel, version, locale, contentRef)Templates store the notification content in a channel-appropriate format. The "order shipped" notification has a push template (short title + body), an email template (HTML with tracking pixels), and an SMS template (160 characters). Versioning allows gradual rollouts of new content without disrupting in-flight notifications.
Connecting Entities to Requirements
- FR1 (send via push/email/SMS) β
NotificationEventtriggers the pipeline;DeliveryAttempttracks each channel-specific delivery. - FR2 (respect preferences) β
UserPreferenceis checked before any delivery attempt is created. Quiet hours, channel opt-outs, and locale are all enforced here. - FR3 (retry and track status) β
DeliveryAttemptwithattemptNo,nextRetryAt,status, anderrorCodeprovides the full retry lifecycle and audit trail.
These entities are a v1 draft. As the architecture reveals operational needs β channel quotas, deduplication keys, provider circuit breaker state β the model will evolve.
With entities defined, we can design the API contract that producers use to trigger notifications.
4. API Design β Fast Acceptance, Async Processing
The fundamental principle of the notification API is: accept fast, process later. A producer calling our API should get a response in under 100 ms. The actual delivery β preference checks, template rendering, provider calls, retries β happens asynchronously in the background.
POST /v1/notifications β Trigger a Notification
{
"eventId": "evt-abc-123",
"userId": "u789",
"type": "order_shipped",
"payloadRef": "payload/order-shipped/u789/1710412345",
"priority": "transactional"
}The producer provides an eventId (idempotency key), the target user, the notification type (which determines which templates to use), a reference to the payload data, and a priority level.
The response is 202 Accepted β not 200 OK. This distinction matters. 200 implies the work is done. 202 explicitly tells the caller: "I have received your request, it is durably stored, and it will be processed asynchronously." The producer does not wait for delivery; it checks status later if needed.
Why not synchronous delivery? At 74,000 peak triggers/s, if the API waited for even one provider call (say, FCM at 50 ms p95), the API's capacity would be gated by provider latency. A single FCM slowdown would cascade into producer timeouts across all services that trigger notifications. Async handoff keeps the ingest path predictable: persist the event, enqueue it, return 202. Total: ~10-20 ms.
409 Conflict β returned when the eventId has already been used. This is the idempotency response: the producer retried, and the system already has this event. The response includes the original event's status so the producer knows it was already accepted.
429 Too Many Requests β returned when the producer exceeds its rate limit. Each internal service has a quota (e.g., the marketing service gets 10,000 triggers/minute; the auth service gets 50,000/minute for OTP codes). This prevents a runaway batch job from flooding the notification pipeline and delaying time-sensitive transactional notifications.
GET /v1/notifications/{eventId}/status β Check Delivery Status
Returns the delivery state for each channel that was attempted:
{
"eventId": "evt-abc-123",
"channels": [
{ "channel": "push", "status": "DELIVERED", "attempts": 1, "latencyMs": 47 },
{ "channel": "email", "status": "SENT", "attempts": 1, "latencyMs": 210 }
]
}This endpoint is for operational monitoring and debugging β a producer can check whether a critical OTP notification was delivered, and if not, see why (e.g., "status": "FAILED", "errorCode": "DEVICE_NOT_REGISTERED").
Provider Callback Webhook β Delivery Receipts
Providers send delivery confirmations asynchronously. FCM sends a delivery receipt when the device acknowledges the push. SendGrid sends webhook events for delivered/bounced/opened. Twilio sends status callbacks for SMS delivery.
The system ingests these callbacks, matches them to DeliveryAttempt records via providerMessageId, and updates the status. But β and this is critical β callbacks are not reliable. They can be delayed, lost, or arrive out of order. We never rely solely on callbacks for delivery status. A separate reconciliation job handles this, which we will design in Step D.
Mapping APIs to Requirements
- FR1 (send via push/email/SMS) β
POST /v1/notificationstriggers async orchestration across channels. - FR2 (respect preferences) β Preference checks happen in the orchestrator after ingest but before channel delivery is enqueued.
- FR3 (retry and track status) β
GET /v1/notifications/{eventId}/statusexposes the delivery lifecycle; callback webhook ingests provider confirmations.
With the API contract clear, we can build the architecture that implements it β starting with the simplest possible pipeline and adding channel isolation, retry logic, and observability as the numbers demand.
5. High-Level Design β Building the Notification Pipeline Step by Step
We will build four progressive versions of the pipeline. Each step adds a component only when a specific problem forces us to.
Step A: Accept, Persist, Enqueue
The simplest possible starting point: the API accepts a notification trigger, stores it durably, publishes it to a message queue, and returns 202.
ββββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β Producer ββββββΆβ Notif API ββββββΆβ Postgres β β SQS β
β (internal ββ202ββ (validate, ββββββΆβ (events + β β (ingest β
β service) β β persist, β β attempts) β β queue) β
ββββββββββββββββ β enqueue) β βββββββββββββββ ββββββββ¬ββββββββ
ββββββββββββββββ β
βΌ
ββββββββββββββββ
β Orchestrator β
β Worker β
ββββββββββββββββTech decision: Postgres for event and delivery attempt storage.
Our notification data is relational by nature: events have delivery attempts, attempts reference users and preferences, and we need transactional guarantees for idempotency checks (event creation must be atomic with the idempotency key check). The query patterns include point lookups (get event by ID), filtered scans (get all pending attempts for retry), and joins (event + attempts for the status endpoint).
Why not Cassandra? Cassandra excels at high-throughput append-only writes with partition-scoped reads β exactly what we used for chat messages. But notification data has different access patterns: we need to query across events (e.g., "find all PENDING delivery attempts older than 5 minutes" for the reconciliation job), update rows in place (attempt status transitions), and perform atomic compare-and-set for idempotency. These are relational workloads. At our scale (74,000 peak delivery attempts/s), Postgres with connection pooling (PgBouncer) and a modest read replica fleet handles the write volume comfortably β each delivery attempt write is a small indexed insert, not a heavy transaction.
Why not DynamoDB? DynamoDB could handle the throughput, but the reconciliation job (Step D) needs to scan for stale PENDING records across all events β a pattern that DynamoDB handles poorly without GSI proliferation. Postgres's indexed queries on status + timestamp make this straightforward.
Tech decision: SQS for the ingest queue.
After persisting the event, the API publishes a message to SQS and returns 202. The orchestrator worker polls SQS and processes events asynchronously.
Why SQS and not Kafka? Kafka is excellent when you need ordered processing within partitions and offset-based replay (as we needed for chat message fanout). Notification processing does not require ordering β a "password reset OTP" and an "order shipped" notification have no ordering relationship. SQS provides exactly what we need: durable message delivery with at-least-once semantics, automatic visibility timeout for processing safety, built-in dead-letter queue support, and no partition management overhead. At 74,000 messages/s, SQS's throughput ceiling (effectively unlimited with FIFO or near-unlimited with standard queues) is more than sufficient.
Why not RabbitMQ? RabbitMQ provides similar queuing semantics but requires cluster management, monitoring, and capacity planning. SQS is fully managed β we trade a small per-message cost for zero operational overhead on the queuing layer. For a notification system where the operational focus should be on provider integration and retry logic, this tradeoff is worth it.
What this step gives us: a durable, async notification pipeline. Producers get fast 202 responses. Events are persisted in Postgres and queued for processing. But every notification goes to every channel β we have not yet added preference filtering, template rendering, or channel routing. That is next.
Step B: Preference Filtering, Template Rendering, Channel Routing
The orchestrator worker consumes an event from the ingest queue and now must answer three questions: Which channels should this notification go to? (preferences), What content should each channel deliver? (templates), and How do we route to per-channel workers without coupling their fates? (channel isolation).
ββββββββββββββββ
β Ingest SQS β
ββββββββ¬ββββββββ
β consume
βΌ
βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Postgres ββββββΆβ Orchestrator ββββββΆβ Template β
β(preferences)β β Worker β β Service β
βββββββββββββββ ββββββββ¬ββββββββ ββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β Push SQS β βEmail SQS β β SMS SQS β
ββββββββββββ ββββββββββββ ββββββββββββThe orchestrator's pipeline, step by step:
- Load user preferences from Postgres: which channels is the user opted into? What is their timezone? Are we in their quiet hours right now?
- Apply quiet hours logic. If the notification is marketing priority and the user's local time is within their quiet window, the orchestrator either delays the event (re-enqueues with a visibility timeout until the quiet window ends) or skips the notification entirely, depending on the notification type's policy. Transactional notifications (OTP codes, security alerts) always bypass quiet hours β a user who just clicked "reset password" needs that code regardless of the time.
- Render content per eligible channel. The template service takes the notification type, the payload data (fetched via
payloadRef), the user's locale, and the target channel, and produces the final content: a push payload (title, body, icon URL), an email (HTML body, subject line), or an SMS (plain text, 160-character limit). Template rendering is a stateless operation β it reads the template definition, substitutes variables, and returns the result. At 74,000 events/s, if rendering averages 2 ms per channel, the orchestrator workers spend roughly 3.2 ms per event on rendering (1.6 channels average). This is fast enough to be synchronous within the orchestrator's processing loop.
- Enqueue delivery attempts to per-channel queues. For each eligible channel, the orchestrator writes a
DeliveryAttemptrow to Postgres (status: PENDING) and publishes a delivery message to the appropriate channel queue: Push SQS, Email SQS, or SMS SQS.
Why per-channel queues instead of a single shared queue?
This is the most important architectural decision in the notification system. Consider what happens with a shared queue during an SMS provider outage: SMS delivery attempts fail, get retried, fail again, and accumulate in the queue. At 74,000 peak attempts/s, if SMS is 5% of traffic and fails for 15 minutes, that is 3,700/s Γ 900s = 3.33 million SMS retries clogging the queue alongside push and email messages. Push workers (which are healthy) now have to wade through millions of SMS messages they cannot process. Push delivery latency spikes from 50 ms to seconds β not because push is broken, but because SMS is broken.
With per-channel queues, the SMS queue backs up and SMS workers retry at their own pace. The push queue and email queue are completely unaffected. This isolation is non-negotiable for a multi-channel system.
Why not Kafka with per-channel topics? Kafka would work technically, but each channel topic would need partition management, consumer group configuration, and offset tracking. SQS queues are simpler for this pattern: each queue is independent, auto-scales, has built-in DLQ support, and requires no partition planning. The operational simplicity matters when you are managing integrations with 4-6 external providers that each have their own failure modes.
What this step gives us: notifications are filtered by user preferences, rendered into channel-appropriate content, and routed to independent per-channel queues. A failure in one channel cannot cascade to others. But we have not yet built the actual delivery workers or retry logic. That is next.
Step C: Channel Workers, Provider Adapters, and Retry Logic
Channel workers consume from their respective queues and make the actual provider API calls β FCM for push, SendGrid/SMTP for email, Twilio for SMS. This is where the notification meets the outside world, and the outside world is unreliable.
ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Push SQS ββββββΆβ Push Worker ββββββΆβ FCM Adapter ββββββΆβ FCM/APNs β
ββββββββββββ β (150 fleet) β β β β (external) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
βEmail SQS ββββββΆβ Email Worker ββββββΆβ SMTP Adapter ββββββΆβ SendGrid β
ββββββββββββ β (30 fleet) β β β β (external) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
ββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β SMS SQS ββββββΆβ SMS Worker ββββββΆβ SMS Adapter ββββββΆβ Twilio β
ββββββββββββ β (12 fleet) β β β β (external) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
On failure: retry with exponential backoff + jitter
After max retries: move to channel DLQProvider adapters abstract the specifics of each provider behind a common interface: send(recipient, content) β (success | transientFailure | permanentFailure). The FCM adapter handles Firebase authentication, payload formatting, and response parsing. The SMTP adapter manages connection pooling, DKIM signing, and bounce handling. The SMS adapter handles Twilio's API, message segmentation for long texts, and delivery receipts. When provider credentials rotate (FCM keys expire, APNs certificates need renewal), only the adapter changes β the worker and retry logic are unaffected.
Retry policy: exponential backoff with jitter and circuit breaking.
When a provider call fails, the worker must decide: retry or give up? The answer depends on the failure type:
Transient failures (HTTP 500, 502, 503, 429, network timeouts): retry with exponential backoff. The schedule for transactional notifications: 5s β 15s β 45s β 2min β 5min (5 attempts max). For marketing notifications: 5s β 30s β 2min (3 attempts max). Jitter adds Β±30% randomness to each delay, preventing a stampede when thousands of attempts fail at the same moment and all schedule their retry for the same second.
Why not retry immediately? At peak, if 20% of 74,000 attempts fail transiently (14,800/s) and all retry instantly, the provider that is already struggling receives a 14,800/s surge of retry traffic on top of the normal 74,000/s. This self-DDoS can turn a minor provider hiccup into a prolonged outage. Backoff and jitter spread the retry load over time.
Permanent failures (HTTP 400, invalid device token, email bounced, phone number deactivated): do not retry. Mark the attempt as TERMINAL and move it to the dead-letter queue (DLQ) for investigation. Permanently failed device tokens should trigger a cleanup workflow that removes the stale token from the user's profile.
Circuit breaker per provider: if a provider's error rate exceeds 50% over a 60-second window, the circuit breaker opens. While open, the worker stops sending to that provider and immediately moves attempts to a delayed retry queue. Every 30 seconds, the breaker allows a small probe batch (1% of traffic) to test whether the provider has recovered. When the probe succeeds, the breaker closes and normal traffic resumes. This prevents the system from wasting resources hammering a dead provider.
Provider rate limiting: FCM allows roughly 600,000 messages/minute per project; Twilio enforces per-account rate limits. The worker pool uses a distributed token bucket (backed by Redis) to enforce outbound rate limits per provider. At 59,200 peak push attempts/s, the system stays within FCM's ceiling of ~10,000/s per-project by distributing across multiple FCM projects or queuing excess attempts with slight delay.
What this step gives us: actual delivery to external providers with intelligent retry logic, failure classification, circuit breaking, and rate limiting. But we have no way to confirm whether delivery actually reached the user's device, or to detect and fix delivery states that got stuck. That is the final piece.
Step D: Observability, Callbacks, and Reconciliation
The delivery attempt reaches the provider, and the provider says "accepted." But "accepted by FCM" does not mean "displayed on the user's phone." The notification might sit in FCM's queue while the user's phone is in doze mode, or the device might have uninstalled the app. The actual delivery status arrives later via provider callbacks β or sometimes never arrives at all.
ββββββββββββββββ ββββββββββββββββ βββββββββββββββ
β Providers ββββββΆβ Callback ββββββΆβ Postgres β
β (webhooks) β β Ingester β β (update β
ββββββββββββββββ ββββββββββββββββ β attempt β
β status) β
βββββββββββββββ
β²
ββββββββββββββββ β
β Reconciliationββββscan stale PENDINGβββββββββββ
β Job ββββpoll provider APIsββββββββββββ
ββββββββββββββββCallback ingestion: providers send delivery receipts asynchronously β FCM via a data message callback, SendGrid via webhook events (delivered, bounced, opened), Twilio via status callback URL. The callback ingester matches each receipt to a DeliveryAttempt via providerMessageId and updates the status to DELIVERED, BOUNCED, or FAILED.
But callbacks are not reliable. This is a critical design principle: you cannot trust that every callback will arrive. Webhooks traverse the open internet β they can be dropped by network issues, delayed by provider batching, or lost entirely if your callback endpoint was briefly unavailable. Even a small loss rate is significant at scale: 0.1% of 800 million daily attempts = 800,000 uncertain delivery states per day that would remain permanently stuck in SENT status without correction.
Reconciliation job: a background process runs every 5 minutes and scans for DeliveryAttempt records with status SENT that are older than a configurable threshold (e.g., 10 minutes for push, 30 minutes for email). For each stale record, the job either polls the provider's delivery status API (FCM offers message status queries; SendGrid offers event lookup) or marks the attempt as UNCERTAIN after a final timeout. At 800k uncertain records/day without reconciliation, the job's scan volume is modest β a Postgres indexed query on (status, createdAt) returns the relevant rows efficiently, and provider API polling is rate-limited to stay within quotas.
Metrics and alerting: the system exposes key operational metrics: success rate per channel per provider, retry rate, DLQ depth, p95 delivery latency (from ingest to provider acceptance), and provider error rate (input to the circuit breaker). When DLQ depth exceeds a threshold or a provider's success rate drops below 90%, alerts fire.
The Life of a Notification β One Trigger, Start to Finish
Let us trace what happens when the order service triggers a "Your order has shipped" notification for a user who has push and email enabled, with quiet hours from 10 PM to 8 AM in US-Pacific time:
- Order service calls
POST /v1/notificationswitheventId,userId,type: "order_shipped",payloadRef, andpriority: "transactional". Time: 3:15 PM Pacific.
- Notification API validates and persists. Checks the
eventIdfor idempotency (not a duplicate). Writes theNotificationEventto Postgres. Publishes to the ingest SQS queue. Returns 202 Accepted. Latency: ~15 ms.
- Orchestrator worker consumes the event. Loads the user's preferences from Postgres: push enabled, email enabled, SMS disabled. Quiet hours are 10 PM - 8 AM Pacific; current time is 3:15 PM β outside quiet hours. Both push and email proceed.
- Template rendering. The orchestrator fetches the "order_shipped" templates for push and email, substitutes the order details from
payloadRef, and produces: a push payload ({ title: "Order Shipped!", body: "Your order #1234 is on its way" }) and an email (HTML with tracking number and delivery estimate). Rendering: ~3 ms.
- Channel routing. The orchestrator writes two
DeliveryAttemptrows to Postgres (one for push, one for email, both status: PENDING). Publishes the push attempt to Push SQS and the email attempt to Email SQS.
- Push worker delivers. Consumes from Push SQS. Calls the FCM adapter with the user's device token and push payload. FCM returns 200 with a
providerMessageId. Worker updates the attempt status to SENT. Latency: ~50 ms.
- Email worker delivers. Consumes from Email SQS. Calls the SendGrid adapter with the user's email and rendered HTML. SendGrid accepts and returns a message ID. Worker updates the attempt status to SENT. Latency: ~150 ms.
- Provider callbacks arrive. FCM sends a delivery receipt 2 seconds later β the user's phone acknowledged the push. Callback ingester updates push attempt to DELIVERED. SendGrid sends a "delivered" webhook 45 seconds later. Email attempt updated to DELIVERED.
- If the push had failed (e.g., FCM returned 503): the worker schedules a retry with 5-second backoff. After 5 seconds, the message becomes visible in Push SQS again. The worker retries. If it fails again, backoff increases to 15 seconds. After 5 failed attempts, the attempt is marked TERMINAL and moved to the DLQ.
Total latency from trigger to push display on the user's phone: ~200-500 ms (ingest + queue + orchestration + provider call + FCM delivery). Well within conversational responsiveness for transactional notifications.
Now that the full pipeline is built, the data model needs to catch up.
6. Core Entities (v2) β How the Pipeline Changed Our Data
Building the pipeline revealed operational needs that v1 did not anticipate.
NotificationEvent(eventId, userId, type, payloadRef, priority, dedupeKey, source, createdAt)
DeliveryAttempt(deliveryId, eventId, channel, provider, status, attemptNo, nextRetryAt,
errorCode, providerMessageId, latencyMs, createdAt, terminalAt)
ChannelQuota(channel, provider, perMinuteLimit, burstLimit)
UserPreference(userId, channel, enabled, quietStart, quietEnd, timezone, locale, muteUntil)
Template(templateId, channel, version, locale, contentRef)NotificationEvent gained dedupeKey and source. dedupeKey is a content-level deduplication key, separate from eventId. Consider: the order service triggers "order shipped" for order #1234. Due to a bug, it triggers the same notification again with a different eventId. The eventId idempotency check passes (it is a new ID), but the dedupeKey (set to order_shipped:order_1234) catches the content-level duplicate. source identifies which internal service triggered the notification, enabling per-source rate limiting and debugging.
ChannelQuota is new. It did not exist in v1 because we had not yet designed provider rate limiting. Each provider has outbound rate limits: FCM allows ~600k messages/minute, Twilio has per-account limits, SendGrid has hourly sending caps. ChannelQuota stores these limits so the worker pool's distributed token bucket can enforce them dynamically β when we upgrade our Twilio plan, we update the quota row, and workers immediately send faster.
UserPreference gained muteUntil. When a user long-presses a notification and selects "Mute for 1 hour," the client sets muteUntil to the current time + 1 hour. The orchestrator checks this field alongside quiet hours. This is simpler than managing a separate mute entity and covers the common case of temporary notification suppression.
With the full pipeline and refined data model in place, we are ready to stress-test the weak points.
7. Deep Dives β Stress-Testing the Pipeline
Deep Dive 1: Channel Isolation Under Provider Outages
It is Black Friday morning. Your SMS provider, Twilio, goes down with a 503 for all requests. You are sending 3,700 SMS attempts per second at peak. Within minutes, the SMS DLQ is filling with failed attempts. Here is the question: does this SMS outage affect push and email delivery?
With a shared queue: yes, catastrophically. SMS retries flood the queue. At 3,700/s Γ 5 retry attempts each = up to 18,500 retry messages accumulating per second. Push workers poll the queue and pull SMS messages they cannot process. Push latency spikes. Email stalls. One provider outage has taken down all three channels.
With per-channel queues: no impact. The SMS queue backs up. SMS workers hit the circuit breaker (error rate >50% for 60 seconds), stop sending, and queue retries for later. Meanwhile, the push queue processes 59,200/s and the email queue processes 11,100/s without any awareness that SMS is struggling. The user's push notification arrives in 200 ms; the SMS is delayed until Twilio recovers. This is exactly the behavior we want: graceful degradation of the broken channel with zero impact on healthy channels.
Going further: priority partitioning within a channel. Even within push, not all notifications are equal. An OTP code (transactional, high priority) must not wait behind a batch of 100,000 marketing push notifications. We split each channel into priority tiers: push-high and push-normal SQS queues. High-priority workers process the push-high queue first. At peak, if marketing push is 70% of traffic (41,440/s) and transactional push is 30% (17,760/s), priority splitting ensures that a marketing batch sending event does not delay security alerts. The tradeoff is more queues to manage β 6 instead of 3 β but the isolation between "user just clicked reset password" and "weekly promo digest" is worth it.
Deep Dive 2: Idempotency at Every Layer
A producer's HTTP call to the notification API times out. The producer retries. Without idempotency, the user gets two "Your order has shipped" emails. Annoying at best, trust-destroying at worst. But idempotency is needed at more than just the API layer.
Layer 1: ingest idempotency via eventId. The API checks whether eventId already exists in Postgres before inserting. If it does, it returns the original response (202 with the existing event's status). This is a simple unique constraint on eventId. At 74,000 peak/s, this check is a single indexed lookup β sub-millisecond.
Layer 2: content deduplication via dedupeKey. Even with eventId deduplication, a buggy producer might send the same logical notification with different event IDs. The dedupeKey (e.g., order_shipped:order_1234) catches this: the orchestrator checks whether a notification with the same dedupeKey was already processed within a configurable window (e.g., 24 hours). If so, it skips processing. This requires a lightweight lookup in Postgres on (dedupeKey, createdAt) β an indexed query that costs ~1 ms.
Layer 3: delivery deduplication via eventId + channel + userId. If a channel worker crashes after successfully calling FCM but before acknowledging the SQS message, the message becomes visible again and another worker picks it up. Without delivery-level deduplication, the provider call is made twice, and the user might receive two push notifications. The worker checks Postgres for an existing SENT or DELIVERED attempt with the same (eventId, channel, userId) before making the provider call. If one exists, it skips the duplicate.
The cost of three-layer deduplication: three Postgres lookups per notification (eventId, dedupeKey, and delivery key). At 74,000/s, this is 222,000 additional point lookups/s. With proper indexing and connection pooling, each lookup is ~1 ms, and the lookups can be batched or cached in Redis for hot paths. The alternative β sending duplicate notifications to users β is worse in every dimension: user trust, provider cost, and support ticket volume.
Deep Dive 3: Push Is Not Guaranteed β The Fallback Chain
A user has an Android phone in doze mode. FCM accepts our push notification and returns 200 β the notification is "sent." But the phone is asleep, battery optimization is aggressive, and the notification is never displayed. The user misses their OTP code and cannot log in.
Why push is unreliable by design. FCM and APNs documentation both describe push delivery as best-effort. The provider accepts the message, but actual delivery depends on device state (doze mode, battery saver, app killed by OS), network connectivity (phone in a subway), and platform-specific throttling (Android may batch low-priority notifications). Firebase's own documentation states that high-priority messages can wake a device from doze, but delivery is still not guaranteed if the device is offline.
The naive approach: treat push as reliable and assume the notification was delivered. Users miss critical alerts. Support tickets pile up.
The good approach: track delivery status and retry. If FCM's delivery receipt does not arrive within 60 seconds for a high-priority notification, schedule a retry push. If the retry also fails, the notification enters a pending state for the reconciliation job.
The great approach: policy-driven fallback chain. Define escalation rules per notification type:
- OTP codes: push first β if no delivery confirmation within 90 seconds β send SMS. The 90-second window gives push a fair chance while ensuring the user gets their code via a more reliable channel before they give up.
- Security alerts (new login, password change): push + email simultaneously. Both channels fire in parallel because the user must see this alert through any means necessary.
- Order updates: push only. If push fails, no fallback β the user can check order status in the app. The cost of an SMS for every order update is not justified.
- Marketing: push only. If push fails, no retry. Marketing notifications are low-value individually and not worth the SMS cost.
The fallback decision worker monitors delivery attempts for eligible notification types. When the push attempt's status remains SENT (no DELIVERED callback) past the fallback timeout, it creates a new delivery attempt for the fallback channel (SMS or email) and enqueues it.
The cost tradeoff: SMS costs ~$0.01 per message. If 5% of OTP pushes fail and trigger SMS fallback, at 10 million OTPs/day, that is 500,000 SMS messages/day Γ $0.01 = $5,000/day in SMS fallback cost. This is the price of reliable OTP delivery β and it is almost always worth it compared to the user drop-off when OTP codes do not arrive.
Deep Dive 4: Retry Storms and Circuit Breaking
The email provider (SendGrid) returns 503 for 3 minutes during a maintenance window. During that window, at 11,100 email attempts/s, roughly 2 million email attempts fail. If all 2 million retry immediately when SendGrid recovers, the sudden surge can trigger another failure β a classic retry storm.
Without backoff: all 2 million retries fire within seconds of the maintenance window ending. SendGrid receives its normal 11,100/s traffic plus 2 million backlogged retries. Even spread over 30 seconds, that is 11,100 + 66,667 = 77,767/s β 7Γ normal load. SendGrid's rate limiter kicks in (429 responses), which triggers more retries, which creates more 429s. The outage extends from 3 minutes to 15 minutes.
With exponential backoff + jitter: retries are spread across a window of 5 seconds to 5 minutes, with random jitter. The retry load after recovery is a gentle ramp, not a spike. SendGrid absorbs it alongside normal traffic. Recovery is smooth.
With circuit breaker: when the email provider's error rate exceeds 50% for 60 seconds, the circuit breaker opens. Workers stop calling SendGrid entirely and immediately re-enqueue attempts with a delayed visibility timeout (5 minutes). This means during the 3-minute outage, zero retries hit SendGrid β they are all queued for later. When the breaker's probe detects recovery (a small test batch succeeds), the breaker closes, and the delayed retries process normally. The outage lasts exactly 3 minutes, not a second longer.
Deep Dive 5: Data Retention and Cost Control
At 320 GB/day of delivery attempt records, storage grows fast. After one year of unmanaged retention, we would have ~117 TB of delivery logs. At typical Postgres storage costs ($0.10/GB/month for gp3 EBS), that is $11,700/month just for storage β plus degraded query performance as tables grow.
The retention strategy:
- Hot tier (0-14 days): full
DeliveryAttemptrows in Postgres. This covers all active retries, recent delivery lookups, and the reconciliation job's scan window. Size: ~4.5 TB. - Warm tier (14-90 days): archive completed (DELIVERED or TERMINAL) attempts to compressed Parquet files in S3. The
GET /v1/notifications/{eventId}/statusendpoint checks Postgres first, then falls back to S3 if the event is older than 14 days. Lookup is slower (~200 ms vs 5 ms) but rare for old notifications. - Cold tier (90 days+): only compact aggregates survive. Per-day, per-channel success/failure counts are stored in a summary table in Postgres (~1 KB per day per channel = negligible storage). Raw attempt data is deleted.
The hot tier's 4.5 TB is manageable for Postgres with proper partitioning (partition the DeliveryAttempt table by createdAt day, drop old partitions). Queries on recent data stay fast because they only scan the current partition.
8. Tying It All Together β From One Queue to a Multi-Channel Pipeline
We started with an API that writes to a database and a queue. Let us look at what we built and why each piece was added:
| Step | Problem Solved | Components Added | Number That Forced It |
|---|---|---|---|
| A | Fast ingest, async processing | Notification API + Postgres + Ingest SQS | 74k peak/s; sync provider calls would block API |
| B | Preference filtering + channel isolation | Orchestrator + Per-channel SQS queues + Template Service | 80/15/5 channel split; SMS outage must not block push |
| C | Actual delivery + intelligent retries | Channel workers + Provider adapters + Circuit breakers | 20% transient failure rate β 14.8k retry/s without backoff |
| D | Delivery confirmation + gap detection | Callback ingester + Reconciliation job + Metrics | 0.1% callback loss β 800k uncertain states/day |
The data model evolved alongside:
| Entity | In v1? | In v2? | What Triggered the Change |
|---|---|---|---|
| NotificationEvent | Yes | Yes (+ dedupeKey, source) | Content-level dedup; per-source rate limiting |
| DeliveryAttempt | Yes | Yes (+ providerMessageId, latencyMs) | Callback matching; performance tracking |
| ChannelQuota | No | Yes | Provider rate limits needed dynamic enforcement |
| UserPreference | Yes | Yes (+ muteUntil) | Temporary mute feature |
Every component traces to a capacity number. Every entity field traces to an operational need. Nothing is here because "everyone uses SQS" β everything is here because the math or the failure mode demanded it.
9. Common Mistakes β What Gets Candidates Rejected
Synchronous provider calls in the API path. At 74,000 peak/s, if the API waits for FCM (50 ms) before responding, you need 3,700 concurrent connections just for FCM. One slow provider cascades into producer timeouts across every internal service.
Missing idempotency. At 1% retry rate on 74k/s = 740 duplicate notifications per second. Users notice.
One queue for all channels. A 15-minute SMS outage should not delay push delivery by a single millisecond. Per-channel queues are non-negotiable.
No DLQ or replay workflow. When a channel worker crashes, the message must not vanish. DLQs capture permanent failures; visibility timeouts protect in-flight processing.
Assuming push means guaranteed delivery. FCM and APNs are best-effort. If your OTP system has no fallback to SMS, users will fail to log in and blame your product.
No reconciliation for callback losses. Even 0.1% callback loss produces 800k uncertain delivery states per day. Without active reconciliation, your delivery status dashboard is unreliable.
10. What the Interviewer Is Actually Evaluating
Did you separate the fast ingest path from the slow delivery path? The API must be fast (202 Accepted); delivery is asynchronous and can take seconds or minutes with retries.
Did you isolate channels from each other? Per-channel queues, independent worker pools, and per-provider circuit breakers demonstrate mature failure thinking.
Did you design retries correctly? Exponential backoff + jitter + circuit breaker + max attempts. Not immediate retries. Not infinite retries. Not one-size-fits-all retry policies.
Did you understand provider semantics? Push is best-effort (FCM/APNs docs say so). Callbacks are unreliable. Provider rate limits are real. Showing awareness of these constraints β by citing the actual documentation β signals genuine engineering depth.
Did you address observability and reconciliation? A system that sends notifications but cannot tell you whether they were delivered is incomplete. The reconciliation job and the metrics pipeline are not nice-to-haves β they are core requirements.
Final Thought
A notification system is invisible infrastructure β users only notice it when it fails. The "Your order has shipped" push notification that arrives within 500 ms is the product of a pipeline that accepted the trigger asynchronously, checked user preferences, rendered channel-specific content, routed to an isolated per-channel queue, delivered through a provider adapter with retry logic and circuit breaking, confirmed delivery via webhook callbacks, and reconciled any gaps with a background scan.
We started with an API and a queue. We ended with a multi-channel pipeline with preference filtering, template rendering, per-channel isolation, exponential backoff with circuit breaking, provider rate limiting, fallback chains, three-layer idempotency, callback ingestion, and active reconciliation. Not because we designed it all upfront, but because each failure mode we encountered β provider outages, retry storms, callback losses, quiet hour violations β demanded exactly one architectural addition.
That is the discipline: complexity earned by evidence, never assumed by convention.
Continue Learning
Payments
PRODesign a Payment System
A payment system looks like a simple CRUD application. Create a payment, charge a card, record the result. You could build a working prototype with Stripe's API in a few hours.
Machine Learning
PRODesign a Recommendation Serving System
A recommendation system looks like a simple retrieval problem. User comes in, you find things they might like, you show them. You could build a working prototype with a nearest-neighbor search and ...
Search
PRODesign a Distributed Search System
A user types "wireless headphones under $100 with good battery" into your search bar. In the next 120 milliseconds, your system must scan 2 billion documents, understand the query intent, match rel...