πŸŽ‰ Launch Sale: Get 30% off annual plans with code LAUNCH30

← Back to Blog
Notificationsβ€’Intermediateβ€’39 min read

Design a Notification System

Asked at:MetaGoogleAmazonUberAirbnb
Tech:PartitioningMessage queueRate limitingCircuit breakerRedisKafka

The Invisible Infrastructure Behind Every Alert

You receive dozens of notifications every day β€” a shipping update, a login verification code, a friend's message, a price drop alert. Each one feels trivial. But behind that "Your order has shipped" push notification is a system that must decide which channel to use (push, email, or SMS), respect your quiet hours, retry if the first attempt fails, avoid sending duplicates, handle provider outages without losing the notification, and do all of this at a rate of 74,000 delivery attempts per second during peak traffic.

Notification systems are deceptively complex because they sit at the intersection of user preferences, multiple unreliable external providers, retry logic, and strict delivery guarantees. The push notification channel alone β€” the one most people assume "just works" β€” is best-effort by design. FCM and APNs explicitly do not guarantee delivery. If your system treats push as a reliable transport, you will lose notifications silently.

This post walks through designing a multi-channel notification system from first principles. We will build progressively: accept and queue notifications first, then route them through user preferences and templates, add per-channel delivery with retry logic, and finally layer on observability and reconciliation. Every decision will be justified by capacity math and grounded in how providers actually behave.

Let us begin.

1. Requirements β€” What This System Must Do

A notification system serves two masters: the internal services that trigger notifications (order service, auth service, marketing platform) and the end users who receive them. The system must be fast enough that producers are not blocked, reliable enough that no notification is silently lost, and respectful enough that users are not spammed at 3 AM.

Functional Requirements

  1. Send notifications via push, email, and SMS. Each channel has different providers (FCM/APNs for push, SMTP/SendGrid for email, Twilio/SNS for SMS), different latency profiles, and different failure modes. The system must abstract these differences behind a unified interface.
  2. Respect user preferences. Users can opt out of specific channels, set quiet hours (no notifications between 10 PM and 8 AM in their timezone), and configure language/locale. The system must check these before attempting delivery, not after.
  3. Retry failed deliveries and track final status. Provider failures are common β€” transient 5xx errors, rate limit 429 responses, network timeouts. The system must retry intelligently and provide an auditable trail of every delivery attempt.

Non-Functional Requirements

  1. Ingest throughput: p95 under 100 ms. Producers (internal services) call the notification API and need a fast acknowledgment. They cannot wait for the notification to actually be delivered β€” that might take seconds (push) or minutes (email with retries).
  2. At-least-once delivery to channel workers. No notification trigger should be silently dropped between ingest and delivery attempt.
  3. Channel failure isolation. An SMS provider outage must not delay push notifications or emails. Each channel operates independently.
  4. Auditable delivery state. For any notification, we can answer: was it delivered? Which channel? How many attempts? What was the final status?
  5. Basic abuse protection. Rate limits on both the producer side (prevent runaway services from flooding the system) and the provider side (stay within FCM/SMTP/Twilio quotas).

Scope Control

In scope: notification orchestration, multi-channel delivery, retry logic, user preferences, delivery tracking.

Out of scope: template editor UI, campaign builder, A/B testing of notification content.

Now that we know what we are building, the next question is: how much load does this system face? The answer determines whether we can process synchronously or must decouple with queues, and how large our worker fleet needs to be.

Login to continue reading

You reached the preview limit. Sign in to unlock the remaining sections.

Continue Learning

πŸŽ‰ Launch Sale!

30% off annual plans with code LAUNCH30

View Pricing