Design a Notification System
Design a Notification System (Interviewer Walkthrough)
Goal: deliver notifications across multiple channels (push/email/SMS) reliably, with user preferences and retries.
---
0) Pre-Design Research Inputs
Research used before architecture:
- FCM priorities (high vs normal) affect delivery urgency and battery behavior.
- APNs payload limits and best-effort delivery model.
- SNS/SQS fanout pattern for decoupled multi-consumer processing.
- 429 semantics for rate limiting on provider/client APIs.
Design implications:
- Push channels are delivery paths, not durable storage of record.
- Multi-channel processing should be decoupled with queues.
- Per-channel retries/backoff and DLQ are mandatory.
---
1) Requirements (~5 min)
Now that we know the problem scope, let us first lock the minimum requirements so architecture decisions stay focused and testable.
Functional
- Send notifications via push/email/SMS.
- Respect user preferences (channel opt-in, quiet hours).
- Retry failed deliveries and track final status.
Non-functional
- Ingest throughput high (event-driven fanout).
- At-least-once delivery to channel workers.
- p95 ingest latency < 100ms.
- Provider outages should not stop all channels.
- Auditable delivery state.
Scope
- In scope: orchestration + delivery + retries + preferences.
- Out of scope: template editor UI, campaign builder UI.
---
2) Capacity Estimation
With requirements fixed, we now quantify event and channel throughput so we can decide where asynchronous boundaries are mandatory.
Assumptions:
- Notification triggers/day:
500M - Avg fanout channels/event:
1.6 - Channel delivery attempts/day:
800M - Peak factor:
8x
Numbers:
- Avg attempts/s:
~9.26k/s - Peak attempts/s:
~74k/s
Implication:
- Synchronous delivery inside request path is unsafe.
- Queue-based worker model is required.
---
2.1) Storage + Ops
Based on throughput, let us estimate storage growth and operational cost, because retries and delivery logs can dominate this system quickly.
Assume NotificationDelivery row ~400B:
800M/day * 400B = 320GB/day- 30-day retention ~
9.6TBsingle copy
Retry metadata + logs can dominate storage quickly. Decision:
- Keep detailed delivery logs short retention.
- Keep final state/aggregates longer.
---
3) Core Entities v1
Now that the load picture is clear, we define entities around lifecycle and observability: trigger, preference, attempt, and template.
NotificationEvent(eventId, userId, type, payloadRef, createdAt)UserPreference(userId, channel, enabled, quietHours, locale)DeliveryAttempt(deliveryId, eventId, channel, provider, status, attemptNo, nextRetryAt, errorCode)Template(templateId, channel, version, contentRef)
Thought process:
NotificationEventis immutable trigger record.DeliveryAttempttracks retry lifecycle and auditability.- Preferences are separate so one update applies to all future events.
Functional requirement traceability:
FR1 (send via push/email/SMS)->NotificationEvent+DeliveryAttempt.channel/provider.FR2 (respect preferences)->UserPreferencecontrols channel eligibility.FR3 (retry and final status tracking)->DeliveryAttempt.status/attemptNo/nextRetryAtand terminal fields.
Why this mapping matters:
- It proves each entity exists to satisfy an explicit requirement, not as generic notification tables.
---
4) API / Interface
With entities in place, let us define the control-plane and data-plane contracts, especially around idempotency and async acceptance.
POST /v1/notifications(internal producer API)GET /v1/notifications/{eventId}/status- Provider callback/webhook endpoint for delivery receipts
Key parameter reasoning:
eventId/idempotency key: prevents duplicate notification creation on client retries.payloadRefinstead of large body inline: keeps queue message lean and reusable.
Error semantics:
202accepted for async processing.409for duplicate idempotency key conflict.429for producer throttling.
Functional requirement to API mapping:
FR1->POST /v1/notificationsstarts async orchestration.FR2-> preference checks happen in orchestrator before enqueueing channel attempts.FR3->GET /v1/notifications/{eventId}/status+ callback endpoint expose delivery lifecycle.
Why this mapping matters:
- Endpoint set directly mirrors requirements, keeping contract minimal and testable.
---
5) High-Level Design (progressive)
Now we can build the architecture progressively, starting with durable ingest and then isolating channel failures and retry behavior.
Step A: Accept + persist + enqueue
Producer calls POST /v1/notifications with eventId, userId, and payloadRef. The API persists the event, publishes to a broker, and returns 202 Accepted. Downstream channel workers consume from the broker and attempt delivery. The question is: how do we acknowledge ingest quickly without blocking on slow provider calls?
Components:
- Notification API
- Event store
- Broker topic/queue
Decision: API returns 202 after durable enqueue
- Need: low ingest latency and decoupled processing.
- Why: provider calls are slow/variable; should not block API.
- Why not sync send: one slow provider spikes ingest latency.
- Choice details:
- persist event row first, - publish event to broker, - return 202 once both are durable.
- How with numbers:
- peak attempts ~74k/s; synchronous downstream calls at this rate would couple API latency directly to provider p95. - async handoff keeps ingest p95 predictable while workers absorb downstream variance.
- Tradeoff:
- client gets async acceptance, not immediate delivery result.
Step B: Preference + template + channel routing
After the orchestrator worker consumes an event, it checks UserPreference for channel opt-in and quiet hours, renders content via the template service, and enqueues delivery attempts to per-channel queues (push/email/SMS). The question is: how do we isolate channel failures so one slow or failing provider does not block delivery on other channels?
Components:
- Orchestrator worker
- Preference store
- Template service
- Per-channel queues (push/email/SMS)
Decision: split per-channel queues
- Need: independent scaling/retry semantics.
- Why not single shared queue: one failing channel can starve others.
- Choice details:
- use SNS topic for fanout from orchestrator, - subscribe separate SQS queues for push/email/SMS workers.
- Why this solves:
- each channel gets independent retry/DLQ/autoscaling.
- Why not Kafka-only for this exact layer:
- possible, but SNS/SQS gives managed fanout + queue isolation with lower ops burden for channel workers.
How with numbers:
- If push is 80% of load and email only 15%, separate queues avoid consumer imbalance.
- Example:
- at 74k/s peak attempts, push may take ~59k/s. - SMS outage should not block push throughput; isolated queues preserve this behavior.
Step C: Provider adapters + retry policy
Channel workers dequeue attempts and call provider adapters (FCM/APNs/SMTP/SMS). Providers return success, transient failure, or permanent failure. On transient failure, workers must retry with appropriate backoff; on permanent failure, the attempt is marked terminal and moved to DLQ. The question is: how do we retry intelligently without creating retry storms or hitting provider rate limits?
Components:
- Channel workers
- Provider adapters (FCM/APNs/SMTP/SMS)
- Retry scheduler + DLQ
Decision: exponential backoff with jitter + max attempts
- Need: transient provider failures are common.
- Why not aggressive immediate retries: creates retry storm and provider throttling.
- Choice details:
- backoff schedule class-based (transactional vs marketing), - per-provider circuit breaker, - DLQ after terminal retries.
- Why this solves:
- reduces synchronized retry bursts and protects provider quotas.
- How with numbers:
- if 20% of 74k/s attempts fail transiently, naive immediate retry can add ~14.8k/s extra load instantly. - jittered exponential retries spread this load over time and reduce second-order failure.
- Tradeoff:
- longer tail for eventual success in degraded provider windows.
Step D: Observability + reconciliation
GET /v1/notifications/{eventId}/status returns delivery state per channel. Providers send callbacks via webhook endpoint to report final delivery/failure. However, callbacks can be lost or delayed. The question is: how do we ensure delivery state is eventually accurate when webhooks are unreliable?
- Delivery status table
- Metrics: success rate, retry rate, DLQ depth, provider latency
- Reconciliation job for callback misses/timeouts
Decision: keep explicit reconciliation pipeline (not callbacks only)
- Need: callback/webhook loss and delayed acknowledgements are real in distributed systems.
- Choice:
- ingest provider callbacks, - periodic polling for stale PENDING attempts, - repair job to finalize delivery state.
- Why this solves:
- prevents orphaned attempts and inconsistent delivery views.
- Why not webhook-only:
- transport failures can drop callbacks; no correction path.
- How with numbers:
- even a small callback loss rate (e.g., 0.1% over 800M/day) is 800k uncertain attempts/day without reconciliation.
- Tradeoff:
- additional background compute and provider API polling cost.
---
Core Entities v2
After evolving the architecture, we refine entities so they capture operational controls (priority, quota, provider correlation IDs).
NotificationEvent(..., priority, dedupeKey, source)DeliveryAttempt(..., providerMessageId, latencyMs, terminalAt)ChannelQuota(channel, perMinuteLimit, burst)UserPreference(..., timezone, muteUntil)
Why changed:
- Added explicit dedupe and priority for operational control.
- Added quota entity to avoid provider/API bans.
---
6) Deep Dives (numeric + mechanism)
With the end-to-end pipeline ready, let us test the highest-risk failure patterns in sequence.
Deep Dive 1: Queue topology
Now that routing exists, first validate whether topology itself causes or prevents failure coupling. Bad: one shared queue for push/email/SMS.
- Why bad: failure coupling across channels.
- Example: SMS provider returns 5xx for 15 minutes.
- How technically:
- retries for SMS flood the same queue. - with peak 74k attempts/s, if SMS is 20% and fails hard, retry load can double queue pressure. - push/email messages wait behind SMS backlog -> user-visible latency spike on unrelated channels.
Good: per-channel queues.
- Why good: isolates backlog, retries, and autoscaling by channel.
- Example: SMS queue depth rises, push queue remains healthy.
- How: channel workers scale independently and use channel-specific retry policies.
Great: per-channel + per-priority partitions (e.g., push-high, push-normal).
- Why great: protects urgent traffic from normal batch traffic.
- Example: OTP/security alerts stay low-latency while marketing notifications queue.
- Tradeoff: more routing rules and ops complexity.
Deep Dive 2: Idempotency
Next, once topology is stable, we verify retry safety so duplicates do not leak to users. Bad: no dedupe key on ingest.
- Why bad: producer retries create duplicate events.
- Example: API timeout at caller causes retry with same payload.
- How technically:
- at peak 74k/s, even 1% retry duplicates = 740 duplicate events/s. - duplicates inflate cost and may send users duplicate notifications.
Good: ingest-level idempotency key persisted with initial event.
- Why good: same key returns existing result instead of creating new event.
- Example:
eventIdreused within 24h returns original accepted response.
Great: dedupe at ingest + delivery worker (deliveryKey = eventId + channel + recipient).
- Why great: protects against duplicates from both producer retries and worker reprocessing.
- Tradeoff: extra storage/lookup overhead in hot path.
Deep Dive 3: Push delivery assumptions
Now we challenge delivery assumptions, because push channels are often misunderstood as guaranteed delivery. Bad: treat push provider as durable guaranteed delivery.
- Why bad: push networks are best-effort and device-state dependent.
- Example: device in doze/offline state misses immediate delivery.
- How: if app relies only on push as source of truth, message/alert can be dropped with no recovery.
Good: model push as delivery attempt with status tracking.
- Why good: system can distinguish SENT vs DELIVERED vs FAILED.
- Example: unacked push after timeout triggers retry or fallback policy.
Great: policy-driven fallback channel chain.
- Example: "push fail 3 times in 10 min -> send email/SMS fallback."
- How: fallback decision worker evaluates attempt history and SLA class.
- Tradeoff: higher channel spend and policy tuning complexity.
Deep Dive 4: Retry storms
With channel semantics clear, we evaluate retry control under provider incidents. Bad: immediate retries with fixed short delay.
- Why bad: synchronized retry storm worsens outage.
- Example: provider returns 503 for 2 minutes.
- How technically:
- failed batch is retried almost instantly by all workers. - outbound QPS multiplies (self-DDoS), queue lag and error rates spike.
Good: exponential backoff + jitter + max attempts.
- Why good: spreads retry load in time and reduces sync spikes.
- Example: retry schedule 5s, 15s, 45s, 2m with randomness.
Great: circuit breaker + adaptive retry budget.
- Why great: when provider is clearly down, system pauses costly retries and preserves resources.
- Example: if provider error rate > 50% for 60s, open breaker for that provider route.
- Tradeoff: delayed recovery for borderline incidents if thresholds are too aggressive.
Deep Dive 5: Data retention
Finally, after reliability logic is set, we optimize retention and storage so operations stay sustainable long-term. Bad: keep all delivery attempt logs forever.
- Why bad: unbounded cost growth.
- Example:
- 320GB/day raw attempts (single copy) becomes ~9.6TB in 30 days. - with replication, storage and query costs multiply.
- How: large history tables hurt index performance and slow operational queries.
Good: TTL detailed attempt rows after short retention window (e.g., 14-30 days).
- Why good: keeps hot operational dataset small.
Great: retain compact aggregates/final state for long-term reporting.
- Example: keep per-day channel success metrics for 1 year, raw attempts for 30 days.
- Tradeoff: deep forensic analysis beyond TTL needs archived cold storage.
---
7) Common mistakes
- Doing synchronous provider calls in ingest API.
- Missing idempotency.
- One queue for all channels.
- No DLQ/replay workflow.
- Assuming push means guaranteed delivery.
---
8) Interviewer signals
- Did you separate control plane (policy/templates) from data plane (delivery)?
- Did you design retries and failure isolation per channel?
- Did you use provider semantics correctly (best effort, limits)?
---
9) References
- Firebase priorities: https://firebase.google.com/docs/cloud-messaging/android/message-priority
- APNs docs: https://developer.apple.com/documentation/usernotifications/setting_up_a_remote_notification_server/sending_notification_requests_to_apns
- SNS/SQS decision guide: https://docs.aws.amazon.com/decision-guides/latest/sns-or-sqs-or-eventbridge/sns-or-sqs-or-eventbridge.html
- RFC 6585 (429): https://www.rfc-editor.org/rfc/rfc6585
Key Takeaways
- 1Send notifications via push/email/SMS.
- 2Respect user preferences (channel opt-in, quiet hours).
- 3Retry failed deliveries and track final status.
- 4Ingest throughput high (event-driven fanout).
- 5At-least-once delivery to channel workers.
Continue Learning
Payments
PRODesign a Payment System (Authorization + Capture + Ledger)
Payment systems fail less from single bugs and more from broken guarantees: duplicate charges, inconsistent ledgers, and weak reconciliation loops. This walkthrough focuses on correctness-first arc...
Machine Learning
PRODesign a Recommendation Serving System
Recommendation serving is an online decision system: every request must return relevant items fast, safely, and with measurable business impact. This design focuses on low-latency personalized serv...
Search
PRODesign a Distributed Search System (Index + Query Serving)
A distributed search system is a two-speed architecture: writes build and refresh indexes continuously, while reads must return relevant results under strict latency SLOs. This walkthrough focuses ...