🎉 Launch Sale: Get 30% off annual plans with code LAUNCH30

← Back to Blog
NotificationsIntermediate11 min read

Design a Notification System

Tech:ReplicationRate limitingCircuit breakerKafkaSSEDistributed systems

Design a Notification System (Interviewer Walkthrough)

Goal: deliver notifications across multiple channels (push/email/SMS) reliably, with user preferences and retries.

---

0) Pre-Design Research Inputs

Research used before architecture:

  1. FCM priorities (high vs normal) affect delivery urgency and battery behavior.
  2. APNs payload limits and best-effort delivery model.
  3. SNS/SQS fanout pattern for decoupled multi-consumer processing.
  4. 429 semantics for rate limiting on provider/client APIs.

Design implications:

  • Push channels are delivery paths, not durable storage of record.
  • Multi-channel processing should be decoupled with queues.
  • Per-channel retries/backoff and DLQ are mandatory.

---

1) Requirements (~5 min)

Now that we know the problem scope, let us first lock the minimum requirements so architecture decisions stay focused and testable.

Functional

  1. Send notifications via push/email/SMS.
  2. Respect user preferences (channel opt-in, quiet hours).
  3. Retry failed deliveries and track final status.

Non-functional

  1. Ingest throughput high (event-driven fanout).
  2. At-least-once delivery to channel workers.
  3. p95 ingest latency < 100ms.
  4. Provider outages should not stop all channels.
  5. Auditable delivery state.

Scope

  • In scope: orchestration + delivery + retries + preferences.
  • Out of scope: template editor UI, campaign builder UI.

---

2) Capacity Estimation

With requirements fixed, we now quantify event and channel throughput so we can decide where asynchronous boundaries are mandatory.

Assumptions:

  • Notification triggers/day: 500M
  • Avg fanout channels/event: 1.6
  • Channel delivery attempts/day: 800M
  • Peak factor: 8x

Numbers:

  • Avg attempts/s: ~9.26k/s
  • Peak attempts/s: ~74k/s

Implication:

  • Synchronous delivery inside request path is unsafe.
  • Queue-based worker model is required.

---

2.1) Storage + Ops

Based on throughput, let us estimate storage growth and operational cost, because retries and delivery logs can dominate this system quickly.

Assume NotificationDelivery row ~400B:

  • 800M/day * 400B = 320GB/day
  • 30-day retention ~9.6TB single copy

Retry metadata + logs can dominate storage quickly. Decision:

  • Keep detailed delivery logs short retention.
  • Keep final state/aggregates longer.

---

3) Core Entities v1

Now that the load picture is clear, we define entities around lifecycle and observability: trigger, preference, attempt, and template.

  • NotificationEvent(eventId, userId, type, payloadRef, createdAt)
  • UserPreference(userId, channel, enabled, quietHours, locale)
  • DeliveryAttempt(deliveryId, eventId, channel, provider, status, attemptNo, nextRetryAt, errorCode)
  • Template(templateId, channel, version, contentRef)

Thought process:

  • NotificationEvent is immutable trigger record.
  • DeliveryAttempt tracks retry lifecycle and auditability.
  • Preferences are separate so one update applies to all future events.

Functional requirement traceability:

  • FR1 (send via push/email/SMS) -> NotificationEvent + DeliveryAttempt.channel/provider.
  • FR2 (respect preferences) -> UserPreference controls channel eligibility.
  • FR3 (retry and final status tracking) -> DeliveryAttempt.status/attemptNo/nextRetryAt and terminal fields.

Why this mapping matters:

  • It proves each entity exists to satisfy an explicit requirement, not as generic notification tables.

---

4) API / Interface

With entities in place, let us define the control-plane and data-plane contracts, especially around idempotency and async acceptance.

  • POST /v1/notifications (internal producer API)
  • GET /v1/notifications/{eventId}/status
  • Provider callback/webhook endpoint for delivery receipts

Key parameter reasoning:

  • eventId/idempotency key: prevents duplicate notification creation on client retries.
  • payloadRef instead of large body inline: keeps queue message lean and reusable.

Error semantics:

  • 202 accepted for async processing.
  • 409 for duplicate idempotency key conflict.
  • 429 for producer throttling.

Functional requirement to API mapping:

  • FR1 -> POST /v1/notifications starts async orchestration.
  • FR2 -> preference checks happen in orchestrator before enqueueing channel attempts.
  • FR3 -> GET /v1/notifications/{eventId}/status + callback endpoint expose delivery lifecycle.

Why this mapping matters:

  • Endpoint set directly mirrors requirements, keeping contract minimal and testable.

---

5) High-Level Design (progressive)

Now we can build the architecture progressively, starting with durable ingest and then isolating channel failures and retry behavior.

Step A: Accept + persist + enqueue

Producer calls POST /v1/notifications with eventId, userId, and payloadRef. The API persists the event, publishes to a broker, and returns 202 Accepted. Downstream channel workers consume from the broker and attempt delivery. The question is: how do we acknowledge ingest quickly without blocking on slow provider calls?

Components:

  • Notification API
  • Event store
  • Broker topic/queue

Decision: API returns 202 after durable enqueue

  • Need: low ingest latency and decoupled processing.
  • Why: provider calls are slow/variable; should not block API.
  • Why not sync send: one slow provider spikes ingest latency.
  • Choice details:

- persist event row first, - publish event to broker, - return 202 once both are durable.

  • How with numbers:

- peak attempts ~74k/s; synchronous downstream calls at this rate would couple API latency directly to provider p95. - async handoff keeps ingest p95 predictable while workers absorb downstream variance.

  • Tradeoff:

- client gets async acceptance, not immediate delivery result.

Step B: Preference + template + channel routing

After the orchestrator worker consumes an event, it checks UserPreference for channel opt-in and quiet hours, renders content via the template service, and enqueues delivery attempts to per-channel queues (push/email/SMS). The question is: how do we isolate channel failures so one slow or failing provider does not block delivery on other channels?

Components:

  • Orchestrator worker
  • Preference store
  • Template service
  • Per-channel queues (push/email/SMS)

Decision: split per-channel queues

  • Need: independent scaling/retry semantics.
  • Why not single shared queue: one failing channel can starve others.
  • Choice details:

- use SNS topic for fanout from orchestrator, - subscribe separate SQS queues for push/email/SMS workers.

  • Why this solves:

- each channel gets independent retry/DLQ/autoscaling.

  • Why not Kafka-only for this exact layer:

- possible, but SNS/SQS gives managed fanout + queue isolation with lower ops burden for channel workers.

How with numbers:

  • If push is 80% of load and email only 15%, separate queues avoid consumer imbalance.
  • Example:

- at 74k/s peak attempts, push may take ~59k/s. - SMS outage should not block push throughput; isolated queues preserve this behavior.

Step C: Provider adapters + retry policy

Channel workers dequeue attempts and call provider adapters (FCM/APNs/SMTP/SMS). Providers return success, transient failure, or permanent failure. On transient failure, workers must retry with appropriate backoff; on permanent failure, the attempt is marked terminal and moved to DLQ. The question is: how do we retry intelligently without creating retry storms or hitting provider rate limits?

Components:

  • Channel workers
  • Provider adapters (FCM/APNs/SMTP/SMS)
  • Retry scheduler + DLQ

Decision: exponential backoff with jitter + max attempts

  • Need: transient provider failures are common.
  • Why not aggressive immediate retries: creates retry storm and provider throttling.
  • Choice details:

- backoff schedule class-based (transactional vs marketing), - per-provider circuit breaker, - DLQ after terminal retries.

  • Why this solves:

- reduces synchronized retry bursts and protects provider quotas.

  • How with numbers:

- if 20% of 74k/s attempts fail transiently, naive immediate retry can add ~14.8k/s extra load instantly. - jittered exponential retries spread this load over time and reduce second-order failure.

  • Tradeoff:

- longer tail for eventual success in degraded provider windows.

Step D: Observability + reconciliation

GET /v1/notifications/{eventId}/status returns delivery state per channel. Providers send callbacks via webhook endpoint to report final delivery/failure. However, callbacks can be lost or delayed. The question is: how do we ensure delivery state is eventually accurate when webhooks are unreliable?

  • Delivery status table
  • Metrics: success rate, retry rate, DLQ depth, provider latency
  • Reconciliation job for callback misses/timeouts

Decision: keep explicit reconciliation pipeline (not callbacks only)

  • Need: callback/webhook loss and delayed acknowledgements are real in distributed systems.
  • Choice:

- ingest provider callbacks, - periodic polling for stale PENDING attempts, - repair job to finalize delivery state.

  • Why this solves:

- prevents orphaned attempts and inconsistent delivery views.

  • Why not webhook-only:

- transport failures can drop callbacks; no correction path.

  • How with numbers:

- even a small callback loss rate (e.g., 0.1% over 800M/day) is 800k uncertain attempts/day without reconciliation.

  • Tradeoff:

- additional background compute and provider API polling cost.

---

Core Entities v2

After evolving the architecture, we refine entities so they capture operational controls (priority, quota, provider correlation IDs).

  • NotificationEvent(..., priority, dedupeKey, source)
  • DeliveryAttempt(..., providerMessageId, latencyMs, terminalAt)
  • ChannelQuota(channel, perMinuteLimit, burst)
  • UserPreference(..., timezone, muteUntil)

Why changed:

  • Added explicit dedupe and priority for operational control.
  • Added quota entity to avoid provider/API bans.

---

6) Deep Dives (numeric + mechanism)

With the end-to-end pipeline ready, let us test the highest-risk failure patterns in sequence.

Deep Dive 1: Queue topology

Now that routing exists, first validate whether topology itself causes or prevents failure coupling. Bad: one shared queue for push/email/SMS.

  • Why bad: failure coupling across channels.
  • Example: SMS provider returns 5xx for 15 minutes.
  • How technically:

- retries for SMS flood the same queue. - with peak 74k attempts/s, if SMS is 20% and fails hard, retry load can double queue pressure. - push/email messages wait behind SMS backlog -> user-visible latency spike on unrelated channels.

Good: per-channel queues.

  • Why good: isolates backlog, retries, and autoscaling by channel.
  • Example: SMS queue depth rises, push queue remains healthy.
  • How: channel workers scale independently and use channel-specific retry policies.

Great: per-channel + per-priority partitions (e.g., push-high, push-normal).

  • Why great: protects urgent traffic from normal batch traffic.
  • Example: OTP/security alerts stay low-latency while marketing notifications queue.
  • Tradeoff: more routing rules and ops complexity.

Deep Dive 2: Idempotency

Next, once topology is stable, we verify retry safety so duplicates do not leak to users. Bad: no dedupe key on ingest.

  • Why bad: producer retries create duplicate events.
  • Example: API timeout at caller causes retry with same payload.
  • How technically:

- at peak 74k/s, even 1% retry duplicates = 740 duplicate events/s. - duplicates inflate cost and may send users duplicate notifications.

Good: ingest-level idempotency key persisted with initial event.

  • Why good: same key returns existing result instead of creating new event.
  • Example: eventId reused within 24h returns original accepted response.

Great: dedupe at ingest + delivery worker (deliveryKey = eventId + channel + recipient).

  • Why great: protects against duplicates from both producer retries and worker reprocessing.
  • Tradeoff: extra storage/lookup overhead in hot path.

Deep Dive 3: Push delivery assumptions

Now we challenge delivery assumptions, because push channels are often misunderstood as guaranteed delivery. Bad: treat push provider as durable guaranteed delivery.

  • Why bad: push networks are best-effort and device-state dependent.
  • Example: device in doze/offline state misses immediate delivery.
  • How: if app relies only on push as source of truth, message/alert can be dropped with no recovery.

Good: model push as delivery attempt with status tracking.

  • Why good: system can distinguish SENT vs DELIVERED vs FAILED.
  • Example: unacked push after timeout triggers retry or fallback policy.

Great: policy-driven fallback channel chain.

  • Example: "push fail 3 times in 10 min -> send email/SMS fallback."
  • How: fallback decision worker evaluates attempt history and SLA class.
  • Tradeoff: higher channel spend and policy tuning complexity.

Deep Dive 4: Retry storms

With channel semantics clear, we evaluate retry control under provider incidents. Bad: immediate retries with fixed short delay.

  • Why bad: synchronized retry storm worsens outage.
  • Example: provider returns 503 for 2 minutes.
  • How technically:

- failed batch is retried almost instantly by all workers. - outbound QPS multiplies (self-DDoS), queue lag and error rates spike.

Good: exponential backoff + jitter + max attempts.

  • Why good: spreads retry load in time and reduces sync spikes.
  • Example: retry schedule 5s, 15s, 45s, 2m with randomness.

Great: circuit breaker + adaptive retry budget.

  • Why great: when provider is clearly down, system pauses costly retries and preserves resources.
  • Example: if provider error rate > 50% for 60s, open breaker for that provider route.
  • Tradeoff: delayed recovery for borderline incidents if thresholds are too aggressive.

Deep Dive 5: Data retention

Finally, after reliability logic is set, we optimize retention and storage so operations stay sustainable long-term. Bad: keep all delivery attempt logs forever.

  • Why bad: unbounded cost growth.
  • Example:

- 320GB/day raw attempts (single copy) becomes ~9.6TB in 30 days. - with replication, storage and query costs multiply.

  • How: large history tables hurt index performance and slow operational queries.

Good: TTL detailed attempt rows after short retention window (e.g., 14-30 days).

  • Why good: keeps hot operational dataset small.

Great: retain compact aggregates/final state for long-term reporting.

  • Example: keep per-day channel success metrics for 1 year, raw attempts for 30 days.
  • Tradeoff: deep forensic analysis beyond TTL needs archived cold storage.

---

7) Common mistakes

  1. Doing synchronous provider calls in ingest API.
  2. Missing idempotency.
  3. One queue for all channels.
  4. No DLQ/replay workflow.
  5. Assuming push means guaranteed delivery.

---

8) Interviewer signals

  • Did you separate control plane (policy/templates) from data plane (delivery)?
  • Did you design retries and failure isolation per channel?
  • Did you use provider semantics correctly (best effort, limits)?

---

9) References

  • Firebase priorities: https://firebase.google.com/docs/cloud-messaging/android/message-priority
  • APNs docs: https://developer.apple.com/documentation/usernotifications/setting_up_a_remote_notification_server/sending_notification_requests_to_apns
  • SNS/SQS decision guide: https://docs.aws.amazon.com/decision-guides/latest/sns-or-sqs-or-eventbridge/sns-or-sqs-or-eventbridge.html
  • RFC 6585 (429): https://www.rfc-editor.org/rfc/rfc6585

Key Takeaways

  • 1Send notifications via push/email/SMS.
  • 2Respect user preferences (channel opt-in, quiet hours).
  • 3Retry failed deliveries and track final status.
  • 4Ingest throughput high (event-driven fanout).
  • 5At-least-once delivery to channel workers.

Continue Learning

🎉 Launch Sale!

30% off annual plans with code LAUNCH30

View Pricing