Design an Ad Click and Impression Tracking System
When Counting Is a Billion-Dollar Problem
Every time you see an ad on a website, an impression event fires. Every time you click one, a click event fires. These events look trivial β tiny JSON payloads, maybe 320 bytes each. But multiply that by 130 billion events per day, and you are processing 41 terabytes of raw data daily, making real-time deduplication decisions at 5 million events per second during peak, joining clicks to their originating impressions across a 30-minute attribution window, filtering out bot traffic that could inflate an advertiser's bill by millions, and producing two different versions of the truth β a fast provisional count for real-time dashboards and a reconciled final count for billing.
Get the counting wrong by 1%, and at $10 billion in annual ad spend flowing through your platform, that is a $100 million discrepancy. Advertisers dispute. Publishers lose trust. Auditors investigate.
This post walks through designing an ad event tracking system from first principles. We will build progressively: durable event ingestion first, then deduplication, attribution joins, real-time analytics, fraud filtering, and billing reconciliation. Every decision will be driven by the numbers β because in ad tech, the numbers are not just engineering constraints, they are the product.
Let us begin.
1. Requirements β What Gets Counted and How
Ad tracking systems serve two masters with fundamentally different needs: product teams want real-time dashboards showing campaign performance right now, and finance teams want billing-grade numbers that can survive an audit. These needs conflict β speed versus correctness β and the architecture must serve both.
Functional Requirements
- Record ad impressions and clicks from web, mobile, and server-to-server channels. Publishers send batches of events; the system must ingest them durably at extreme throughput.
- Deduplicate retries and preserve attribution. Mobile SDKs retry on timeout. Server-to-server integrations replay on failure. Every duplicate must be suppressed. Every click must link to a prior eligible impression β without this link, the click is unbillable.
- Provide near-real-time metrics β impressions, clicks, CTR β sliced by campaign, ad, geo, and device. Dashboard freshness target: under 10 seconds.
- Produce billing-grade daily aggregates with invalid-traffic filtering and reconciliation. These numbers go on invoices. They must be versioned, auditable, and reproducible.
- Support reprocessing and backfill when fraud models are updated or attribution rules change. If a new model reclassifies 2% of traffic as invalid, the system must recompute affected billing periods without manual intervention.
Non-Functional Requirements
- Sustained ingest: 1.5 million events per second. Peak: 5 million events/s during major events (Super Bowl, Black Friday, election nights).
- Ingestion availability: 99.99%. Every lost event is lost revenue for a publisher or overbilling for an advertiser.
- Data loss under 0.01% for accepted events. Late events tolerated up to 24 hours.
- Dashboard freshness under 10 seconds. Billing closure within T+1 (numbers finalized by end of next business day).
- 90-day hot retention for raw events; archived to cold storage beyond that.
Scope Control
In scope: ingest API, streaming pipeline, deduplication, attribution join, real-time OLAP serving, billing aggregates, invalid-traffic scoring.
Out of scope: ad auction ranking logic, creative serving CDN, advertiser UI.
Now we need to understand the scale β because at 130 billion events per day, every architectural choice is load-bearing.
Login to continue reading
You reached the preview limit. Sign in to unlock the remaining sections.
2. Capacity Estimation β The Numbers That Build the Pipeline
Throughput
- Average event rate: 1.5 million events/s
- Peak event rate: 5 million events/s (short bursts during major events)
- Event mix: 85% impressions, 15% clicks
- Daily events: 1.5M Γ 86,400 = ~130 billion events/day
Event Size
Each event carries: eventId (16 bytes), impressionId or clickId (16 bytes), adId (16 bytes), campaignId (16 bytes), publisherId (16 bytes), userKey hash (32 bytes), device enum + attributes (24 bytes), geo (country/region/city codes β 12 bytes), ts (8 bytes), requestId (16 bytes), HMAC signature (32 bytes), plus serialization overhead (JSON field names, delimiters β ~136 bytes). Total: ~320 bytes per event.
Storage
- Raw daily volume: 130B events Γ 320 bytes = ~41.5 TB/day
- With 3Γ replication (Kafka + storage): ~124.5 TB/day of durable write pressure
- 90-day hot retention: ~3.7 PB raw (single copy)
Dedupe State
The deduplication window is 24 hours (late events can arrive up to 24h after the original). Each dedupe record stores the idempotency key and minimal metadata β approximately 40 bytes effective (key hash + first-seen timestamp + source + partition info + overhead).
- 24h dedupe state: 130B events Γ 40 bytes = ~5.2 TB logical
- This is too large for pure RAM. It must be served from a sharded SSD-backed store with a probabilistic front filter (Bloom filter) to absorb the majority of lookups in memory.
What These Numbers Force
5M peak events/s means the ingest tier must be stateless and horizontally scaled, writing to a durable append-only log (not directly to a database). Any synchronous processing in the ingest path β deduplication, attribution, aggregation β would create back-pressure that drops events during spikes.
41.5 TB/day means raw event storage must be columnar and partitioned for efficient analytical queries. Row-oriented databases would buckle under the scan volume.
5.2 TB dedupe state means deduplication cannot be pure in-memory. A tiered approach β Bloom filter in memory for fast negative answers, SSD-backed store for exact checks β is mandatory.
Two output paths (10-second dashboard freshness vs T+1 billing closure) mean we need separate aggregation pipelines with different correctness guarantees.
With these numbers established, let us define the data that flows through the system.
3. Core Entities (v1) β What Gets Recorded and Linked
Ad tracking entities fall into three categories: raw events (immutable records of what happened), processing state (deduplication, attribution), and output aggregates (metrics, billing).
ImpressionEvent and ClickEvent
ImpressionEvent(eventId, impressionId, adId, campaignId, publisherId, userKey,
device, geo, ts, requestId, signature)
ClickEvent(eventId, clickId, impressionId, adId, campaignId, publisherId, userKey,
device, geo, ts, requestId, signature)These are the raw, immutable event records. Once written to the durable log, they are never modified β only read by downstream processors.
impressionId on ClickEvent is the attribution anchor. It links every click to the specific impression that the user saw before clicking. Without this link, a click has no proof that an ad was actually displayed β which means it is unbillable. Advertisers pay for clicks on ads that were shown. An orphan click (no matching impression) could be a bot fabricating clicks, a user clicking a cached page where the impression tracking failed, or a timing artifact. The attribution join (Step C) determines whether the link is valid.
signature is an HMAC computed by the publisher's SDK using a shared secret key: HMAC-SHA256(secretKey, eventId + ts + adId + publisherId). The ingest API validates this signature before accepting the event. This prevents event spoofing β a malicious actor cannot fabricate impression events to inflate a publisher's revenue without possessing the publisher's secret key. Signature validation happens at the edge (ingest tier), before the event enters the durable log, rejecting malformed or spoofed events at the earliest possible point.
eventId and requestId together form the idempotency key. eventId is generated by the SDK per event. requestId is the batch delivery ID. The combined key hash(publisherId + eventId + eventType) ensures that retries of the same event from the same publisher are deduplicated, while identical events from different publishers (different publisherId) are correctly treated as separate occurrences.
DedupeRecord
DedupeRecord(idempotencyKey, firstSeenTs, source, partitionKey)One record per accepted event, stored in the dedupe store with a 24-hour TTL. When a new event arrives, the system checks: does this idempotency key already exist? If yes, the event is a duplicate β return 409 and discard. If no, accept the event and create the record.
AttributionLink
AttributionLink(clickId, impressionId, attributionModel, lookbackWindowSec, isValid, reasonCode)Created by the stream processor when a click is successfully joined to an impression within the attribution window. isValid indicates whether the join meets all criteria (impression exists, within time window, same user, not already attributed). reasonCode explains rejection: IMPRESSION_NOT_FOUND, OUTSIDE_WINDOW, USER_MISMATCH, DUPLICATE_ATTRIBUTION.
RealtimeMetricBucket
RealtimeMetricBucket(bucketTs, campaignId, adId, geo, device, impressions, clicks, invalidEvents)Pre-aggregated metrics at 10-second bucket granularity. These power the real-time dashboard. They are provisional β they reflect the best-known state as of the stream processing watermark, but late events and fraud filtering may adjust them later.
BillingDailyAggregate
BillingDailyAggregate(day, advertiserId, campaignId, billableImpressions, billableClicks,
spendMicros, adjustments, version)The billing source of truth. Produced by the nightly reconciliation pipeline. version increments on every reprocessing run, so every published number is traceable to a specific pipeline execution. spendMicros stores monetary amounts in microcurrency (millionths of a dollar) to avoid floating-point precision issues β critical when aggregating billions of events into invoices.
Connecting Entities to Requirements
- FR1 (record events) β
ImpressionEvent,ClickEventβ raw durable records. - FR2 (dedupe + attribution) β
DedupeRecordsuppresses duplicates;AttributionLinkvalidates click-to-impression joins. - FR3 (real-time metrics) β
RealtimeMetricBucketserves dashboards at 10-second granularity. - FR4 (billing) β
BillingDailyAggregatewith invalid-traffic filtering and versioning. - FR5 (reprocessing) β immutable raw events + versioned output snapshots enable reproducible reruns.
With entities defined, let us design the API.
4. API Design β Batch Ingest and Dual-Path Queries
Ad event tracking APIs serve three distinct consumers: publishers submitting events, product teams querying dashboards, and finance teams pulling billing data. The ingest path must be blazing fast (accept and acknowledge batches at 5M events/s peak); the query paths serve different correctness contracts.
Event Ingest
POST /v1/events/impressions:batch and POST /v1/events/clicks:batch
Publishers send batches of events (typically 100-1,000 events per request) to reduce per-event HTTP overhead. Each event includes eventId, ts, adId, campaignId, publisherId, requestId, and signature.
The response is 202 Accepted β the events are schema-validated, signature-verified, and committed to the durable log, but not yet deduplicated, attributed, or aggregated. Those steps happen asynchronously in the stream pipeline. Why 202? At 5M events/s peak, blocking on deduplication or attribution would add latency that causes publisher SDKs to timeout and retry β which creates more duplicates. Accept fast, process later.
Error responses:
- 400 β malformed payload, missing required fields.
- 401/403 β invalid publisher credentials or signature verification failure.
- 409 β duplicate
eventIddetected within the active dedupe window (fast-path dedupe at the ingest tier catches obvious retries). - 422 β event timestamp outside the accepted clock skew window (Β±5 minutes). Events with wildly inaccurate timestamps are rejected rather than allowed to pollute attribution joins.
- 429 β publisher rate limit exceeded. Each publisher has a per-second ingest quota to prevent one publisher's burst from overwhelming shared infrastructure.
Dashboard Query
GET /v1/metrics/realtime?campaignId=X&from=T1&to=T2&dimensions=geo,device
Returns provisional real-time metrics from the streaming aggregation layer. These numbers are fresh (under 10 seconds old) but subject to change β late events, fraud reclassification, and reconciliation will adjust them. The response includes a dataFreshness timestamp so consumers know exactly how current the data is.
Billing Query
GET /v1/billing/daily?advertiserId=X&day=2026-03-15
Returns the reconciled billing aggregate for a specific day. These numbers are final (after T+1 closure), versioned, and auditable. The response includes snapshotVersion and generatedTs so any discrepancy can be traced to a specific pipeline run.
Reprocessing
POST /v1/reprocess?from=2026-03-10&to=2026-03-15&reason=fraud_model_update&modelVersion=v2.3
Triggers a backfill of the specified date range with the new model version. The system reads immutable raw events from the archive, reprocesses them through the updated pipeline, and produces new versioned output snapshots. The old snapshots are retained for audit comparison.
With the API defined, we can build the architecture.
5. High-Level Design β Building the Tracking Pipeline Step by Step
Step A: Durable Ingest at Scale
The first and most critical decision: where do we draw the durability boundary? When a publisher sends a batch of events, at what point do we acknowledge them?
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββ
β Publisher ββββββΆβ Ingest API ββββββΆβ Kafka β
β SDK/Server ββ202ββ (stateless, β β (durable log, β
β β β validate, β β partitioned by β
ββββββββββββββββ β sign check)β β campaignId) β
ββββββββββββββββ ββββββββββββββββββββββββ
~200 instances
at peak Tech decision: Kafka as the durable ingestion spine.
We acknowledge the event after it is committed to Kafka β not after it is deduplicated, not after it is aggregated, not after it hits the OLAP store. Kafka provides the durability guarantee: once committed, the event is replicated across brokers and will not be lost. Everything downstream reads from Kafka.
Why Kafka and not Kinesis? Kafka gives us control over partition count, retention, and consumer group semantics. At 5M events/s peak with 320-byte events, we need significant partition parallelism. Kafka allows us to size partitions precisely: at ~50,000 events/s per partition as a comfortable target, we need 100 partitions minimum at peak. We choose 128 partitions for headroom. Kinesis's shard model works but has per-shard throughput limits (1 MB/s write, 2 MB/s read) that would require ~1,600 shards at peak β more operational overhead for this scale.
Why not writing directly to ClickHouse or a database? At 5M events/s peak, direct writes to an analytical database create back-pressure during compaction, schema changes, or query load spikes. The database becomes an ingest bottleneck. Kafka decouples ingest throughput from processing speed β if a downstream consumer falls behind, events buffer in Kafka instead of being dropped.
Partition key: campaignId. All events for the same campaign land in the same partition, enabling the stream processor to join clicks to impressions (Step C) without cross-partition lookups. This means a very hot campaign (Super Bowl advertiser) might dominate one partition β we handle this with partition splitting for extreme cases.
Ingest tier sizing: at 5M events/s peak, if each ingest node handles ~25,000 events/s (accounting for signature validation, schema parsing, and Kafka produce latency), we need ~200 ingest instances at peak. These are stateless β any instance can handle any publisher's events. A load balancer distributes traffic, and auto-scaling adjusts the fleet based on ingest rate.
Signature validation at the edge: every event's HMAC signature is verified at the ingest tier before the event enters Kafka. This rejects spoofed events at the cheapest possible point β before they consume Kafka bandwidth, stream processing CPU, or storage. At 320 bytes per event and 5M events/s, allowing spoofed events through would waste 1.6 GB/s of pipeline capacity on garbage.
What this step gives us: events are durably captured at up to 5M/s with 99.99% availability. Publishers get fast 202 acknowledgments. Everything downstream reads from Kafka at its own pace. But we have not yet handled duplicates. That is next.
Step B: Deduplication at Scale
Publisher SDKs retry on network timeout. Server-to-server integrations replay on failure. At 5M events/s peak with a typical retry rate of 5-10% during incidents, that is 250,000-500,000 duplicate events per second during a spike. Without deduplication, those duplicates inflate impression counts, corrupt CTR calculations, and overbill advertisers.
ββββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββ
β Kafka ββββββΆβ Dedupe Stream ββββββΆβ Kafka β
β (raw events) β β Processor β β (deduped β
β β β β β events) β
ββββββββββββββββββββ β ββββββββββββββββ β ββββββββββββββββ
β β Bloom Filter β β
β β (in-memory) β β
β ββββββββ¬ββββββββ β
β β uncertain β
β βΌ β
β ββββββββββββββββ β
β β RocksDB β β
β β (SSD-backed β β
β β exact store)β β
β ββββββββββββββββ β
ββββββββββββββββββββββTech decision: tiered deduplication β Bloom filter + RocksDB.
The idempotency key is hash(publisherId + eventId + eventType). For each incoming event, the dedupe processor first checks a Bloom filter (in-memory, ~10 GB across all instances for a 0.1% false-positive rate at 130B daily keys). If the Bloom filter says "definitely not seen" (the common case for non-duplicate events), the event passes through immediately β no disk I/O. If the Bloom filter says "maybe seen" (either a true duplicate or a false positive), the processor checks the exact store.
Tech decision: RocksDB for the exact dedupe store.
RocksDB is an embedded, SSD-optimized key-value store that handles the 5.2 TB logical dedupe state with efficient compaction and TTL-based expiration. Each stream processor instance maintains its own RocksDB store, partitioned by the same Kafka partition key. When a key's 24-hour TTL expires, RocksDB's compaction naturally removes it.
Why not Redis? Redis is excellent for in-memory lookups, but 5.2 TB of dedupe state far exceeds practical Redis memory. Even with a cluster, this would require ~100 Redis nodes with 64 GB each β expensive for data that is fundamentally a TTL-expiring index. RocksDB on local SSDs is an order of magnitude cheaper for this access pattern: write-once, read-occasionally, expire-after-24h.
Why not DynamoDB? DynamoDB could handle the throughput, but at 130 billion writes per day (one per event for dedupe record creation) plus reads for duplicate checks, the cost at DynamoDB's on-demand pricing would be substantial. RocksDB on local SSD is effectively free after the instance cost.
Numeric impact of tiered dedup: at 5M events/s peak, if the Bloom filter resolves 92% of checks in memory (non-duplicates pass through, obvious non-matches skip the exact store), only 400,000 events/s hit RocksDB. This is comfortably within SSD IOPS limits for a sharded fleet of stream processor instances.
What this step gives us: duplicate-free event stream in Kafka. Every event that passes the dedupe stage has been verified as unique within the 24h window. But we have not yet linked clicks to impressions. That is next.
Step C: Attribution Join β Connecting Clicks to Impressions
When a user sees an ad (impression) and clicks it 2 minutes later (click), the click event carries the impressionId of the ad they saw. The stream processor must verify: did this impression actually happen? Was it within the attribution window (typically 30 minutes)? Is the user the same? Only validated joins produce billable clicks.
ββββββββββββββββββββ βββββββββββββββββββββββββββββ ββββββββββββββββ
β Kafka ββββββΆβ Flink Stream Processor ββββββΆβ Kafka β
β (deduped β β β β (attributed β
β events) β β Impression state: β β events) β
β β β - Buffer recent β ββββββββββββββββ
ββββββββββββββββββββ β impressions per user β
β - 30min window β
β - ~50GB state β
β β
β Click arrives: β
β - Lookup impressionId β
β - Validate window/user β
β - Emit AttributionLink β
βββββββββββββββββββββββββββββTech decision: Apache Flink for stream processing.
Attribution requires stateful stream processing: the processor must buffer recent impressions (keyed by impressionId) and join incoming clicks against them. Flink is purpose-built for this β it provides event-time processing with watermarks (critical for handling late events), exactly-once state semantics (critical for billing accuracy), and efficient state backends (RocksDB for large state).
Why not Spark Structured Streaming? Spark's micro-batch model introduces latency (typically 1-10 seconds per batch). For the real-time dashboard path (10-second freshness target), Flink's continuous processing model gives us lower, more predictable latency. For the billing path, either would work, but using Flink for both paths simplifies the architecture.
Why not a custom stream processor? The attribution join requires: event-time windowing, watermark tracking for late events, exactly-once state checkpointing, and efficient key-based state access at 5M events/s. Building this correctly from scratch is a multi-year project. Flink provides it out of the box.
How the join works: Flink maintains a keyed state store (backed by RocksDB) mapping impressionId β (userId, timestamp, adId, campaignId) for all impressions within the 30-minute attribution window. When a click arrives with an impressionId, Flink looks up the impression state:
- Found, within window, user matches: valid attribution. Emit an
AttributionLinkwithisValid = true. The click is billable. - Found, outside window: expired attribution. Emit with
isValid = false, reasonCode = OUTSIDE_WINDOW. The click is not billable. - Not found: orphan click β the impression either never happened, arrived after the click (out-of-order), or has already expired from state. Emit with
isValid = false, reasonCode = IMPRESSION_NOT_FOUND.
State size: at 1.275M impressions/s (85% of 1.5M), a 30-minute window buffers ~2.3 billion impression records. At ~22 bytes per record (impressionId + userId + timestamp), that is ~50 GB of state β manageable with Flink's RocksDB state backend on SSD.
Late event handling: mobile events frequently arrive out of order β a click might arrive before its impression (the impression was delayed by the SDK's batch upload cycle). Flink's watermark mechanism handles this: events arriving within the allowed lateness window (configurable, e.g., 5 minutes beyond the watermark) are processed normally and update state retroactively. Events arriving after the allowed lateness window are emitted to a side output for the nightly reconciliation pipeline to process.
What this step gives us: a validated event stream where every click is either attributed to an impression (billable) or marked as orphaned/invalid (not billable). Both the real-time aggregation and billing reconciliation read from this attributed stream. But we still need to turn these events into queryable metrics and handle fraud. That is the final step.
Step D: Dual-Path Serving β Dashboard Speed vs Billing Truth
The attributed event stream feeds two separate output paths with different correctness contracts.
βββββββββββββββββββββββββ
βββββΆβ Real-Time Aggregation ββββΆ ClickHouse (OLAP)
β β (Flink, 10s buckets) β β
ββββββββββββββββ β βββββββββββββββββββββββββ βΌ
β Kafka ββββββ€ GET /v1/metrics/realtime
β (attributed β β
β events) β β βββββββββββββββββββββββββ
βββββΆβ Nightly ReconciliationββββΆ Postgres (billing)
β (Flink batch, β β
β fraud model, β βΌ
β versioned snapshots) β GET /v1/billing/daily
βββββββββββββββββββββββββPath 1: Real-time dashboard (ClickHouse).
A Flink job continuously aggregates attributed events into RealtimeMetricBucket rows: impressions, clicks, and invalid counts bucketed by 10-second intervals, sliced by campaign, ad, geo, and device. These buckets are written to ClickHouse β a columnar OLAP database designed for fast analytical queries on pre-aggregated data.
Why ClickHouse? The dashboard query pattern is: "show me impressions and clicks for campaign X, grouped by geo, for the last 6 hours." This is a scan over time-partitioned, dimension-filtered aggregate rows β exactly what ClickHouse is optimized for. At 130B events/day pre-aggregated into 10-second buckets across ~50,000 dimension combinations, the OLAP table grows at a manageable rate (~432M rows/day).
Why not Druid? Druid is a strong alternative for real-time OLAP ingestion. ClickHouse's advantage here is simpler operations (single binary, no ZooKeeper dependency in recent versions) and faster ad-hoc query performance for the dimension combinations our dashboards require. Either would work; we choose ClickHouse for operational simplicity.
These numbers are provisional. The dashboard prominently labels data as "provisional" and includes a dataFreshness timestamp. Product teams understand that these numbers may shift by 1-2% after nightly reconciliation.
Path 2: Billing reconciliation (Postgres).
A nightly batch Flink job reads the full day's attributed events from Kafka (retained for 90 days), applies the latest fraud/invalid-traffic model, recomputes attribution for any late-arriving events that the stream processor missed, and produces BillingDailyAggregate rows in Postgres. Each run creates a new snapshotVersion β the old version is retained for audit.
Why Postgres for billing? Billing data is relational: aggregates reference advertisers, campaigns, and adjustment records. The query volume is low (finance teams pulling daily reports, not 10,000 QPS dashboards). The data size is small compared to raw events β one row per campaign per day per dimension. ACID transactions ensure that a billing snapshot is either fully written or not at all.
Invalid traffic filtering runs as part of the nightly pipeline. A scoring model evaluates each event based on: click cadence per user (impossible speeds indicate bots), IP reputation (known bot farm ranges), device fingerprint anomalies (headless browsers, emulators), and click-through patterns (abnormally high CTR from a single publisher). Events scoring above the invalid-traffic threshold are excluded from billable aggregates and recorded in InvalidTrafficDecision rows with the model version and reason codes.
The T+1 billing closure process: at 2 AM UTC (after the day's events are fully ingested plus the 24h late-event window), the reconciliation pipeline runs. It takes approximately 2-4 hours to process a full day's events. By 6 AM UTC, the billing snapshot is ready. A data quality check compares the provisional dashboard numbers to the reconciled billing numbers: if the delta exceeds 2% for any campaign, an alert fires for the data engineering team. Finance reviews and approves the snapshot before it becomes the invoice source.
The Life of an Ad Event β From Impression to Invoice
Let us trace exactly what happens when a user sees an ad and clicks it:
- The user loads a webpage. The ad server selects ad #4521 from campaign #789 (Acme Shoes) and serves it. The publisher's JavaScript SDK fires an impression event:
{eventId: "imp-001", impressionId: "imp-001", adId: 4521, campaignId: 789, publisherId: "pub-55", userKey: "hash-abc", ts: 1710412345, signature: "hmac..."}.
- Ingest API receives the event. Validates the schema, verifies the HMAC signature against publisher "pub-55"'s secret key, checks the timestamp is within Β±5 minutes of server time. Produces the event to Kafka partition #42 (determined by
campaignId: 789). Returns 202 Accepted. Latency: ~5 ms.
- Dedupe processor reads the event. Computes idempotency key:
hash("pub-55" + "imp-001" + "impression"). Checks Bloom filter β not seen. Checks RocksDB β not found. Accepts the event, writes the dedupe record. Produces the clean event to the deduped Kafka topic.
- Attribution processor buffers the impression. Flink stores
impressionId: "imp-001" β {userId: "hash-abc", ts: 1710412345, adId: 4521}in keyed state. The impression is now available for click attribution for the next 30 minutes.
- Two minutes later, the user clicks the ad. The SDK fires a click event:
{eventId: "click-001", clickId: "click-001", impressionId: "imp-001", adId: 4521, ...}.
- Click passes through ingest and dedupe (same path as the impression).
- Attribution processor joins the click. Flink looks up
impressionId: "imp-001"β found, timestamp 2 minutes ago (within 30-minute window), sameuserKey. Valid attribution. EmitsAttributionLink(clickId: "click-001", impressionId: "imp-001", isValid: true)to the attributed events topic.
- Real-time aggregation updates the dashboard. The Flink aggregation job increments the
clickscounter in the current 10-secondRealtimeMetricBucketfor campaign #789, ad #4521, US geo, mobile device. Within 10 seconds, the Acme Shoes campaign manager sees the click on their dashboard.
- Nightly reconciliation. At 2 AM UTC, the batch pipeline reprocesses all of today's events. The fraud model scores the click: normal user behavior, legitimate device, reasonable click timing. Score: 12 (well below the invalid-traffic threshold of 60). The click is included in the billing aggregate.
BillingDailyAggregatefor campaign #789 incrementsbillableClicksby 1 andspendMicrosby $0.35 (350,000 microdollars β the click's CPC bid).
- Finance approves the snapshot. The quality check shows the provisional-final delta for campaign #789 is 0.8% β within the 2% threshold. The snapshot is approved and becomes the invoice source.
Now that the full pipeline is built, the data model needs to evolve.
6. Core Entities (v2) β How the Pipeline Changed Our Data
EventEnvelope(eventGuid, tenantId, eventType, logicalTs, ingestTs, schemaVersion, traceId)
ImpressionEvent / ClickEvent (unchanged from v1, wrapped in EventEnvelope)
AttributionState(impressionId, firstSeenTs, expiryTs, joinedClickCount, lastWatermarkTs, status)
ProvisionalMetricBucket(bucketTs, dimsHash, impressions, clicks, invalid, latenessAdjustments)
FinalMetricSnapshot(day, dimsHash, billableImpressions, billableClicks, spendMicros, snapshotVersion, generatedTs)
FraudModelRun(runId, modelVersion, windowStart, windowEnd, threshold, precision, recall)
InvalidTrafficDecision(eventId, score, modelVersion, decision, reason, decidedTs)
ReprocessJob(jobId, rangeStart, rangeEnd, reason, inputSnapshot, outputSnapshot, status, startedAt, endedAt)
DataQualityCheckpoint(checkpointId, windowStart, windowEnd, expectedEvents, acceptedEvents, duplicateRate, lateRate, driftFlags)EventEnvelope is new. Wrapping raw events in a versioned envelope allows the pipeline to handle schema evolution gracefully. When the event format changes (new fields, renamed fields), the schemaVersion tells each processor which parser to use. Without versioning, a schema change requires reprocessing the entire archive.
AttributionState replaces the static view. The stream processor needs to track join state actively: how many clicks have been attributed to an impression, what the current watermark is, and whether the impression is still in the active window. This operational state drives the join logic.
ProvisionalMetricBucket gained latenessAdjustments. Late events that arrive after the initial aggregation must be accounted for β the bucket's counts are adjusted retroactively, and the adjustment is tracked explicitly for debugging.
FraudModelRun is new. Each nightly reconciliation run records which fraud model version was used, what threshold was applied, and the model's precision/recall metrics. This is critical for audit: when an advertiser disputes a billing adjustment, the system can trace exactly which model version made each decision.
DataQualityCheckpoint is new. The pipeline tracks expected vs actual event counts, duplicate rates, and late arrival rates per time window. When drift exceeds thresholds (e.g., duplicate rate suddenly spikes to 15%), automated alerts fire before bad data propagates to billing.
7. Deep Dives β Where Counting Breaks
Deep Dive 1: Duplicate Suppression Under Retry Storms
It is Black Friday. Your largest publisher's mobile SDK has a bug: on every network timeout (which happens frequently under load), the SDK retries 3 times. Your normal 8% retry rate spikes to 25%. At 5M events/s peak, that is 1.25 million duplicate events per second flooding the pipeline. Without deduplication, your impression counts for the day are inflated by 25%. An advertiser with a $500,000 daily budget sees their budget exhausted 25% faster β their ads stop showing by 3 PM instead of midnight. They lose $125,000 in sales opportunity because your system counted ghost impressions.
Without dedup: all 1.25M duplicate events/s are counted as real. Reported impressions are 125% of actual. Billing is overstated by 25%. The advertiser disputes, your finance team manually investigates, and trust is damaged.
With tiered dedup: the Bloom filter catches 92% of duplicate checks in memory. Only 100,000 events/s hit RocksDB. Of those, true duplicates are rejected in ~0.5 ms (SSD read). The pipeline handles the retry storm with no visible impact on counting accuracy or ingest latency. Reported impressions are accurate to within the Bloom filter's false-positive rate (0.1%).
The critical design choice: the Bloom filter's false-positive rate. At 0.1%, one in 1,000 legitimate events might be flagged as "maybe duplicate" and sent to the exact store for verification. This adds ~0.5 ms of latency for 0.1% of events β negligible. If we tightened to 0.01%, the Bloom filter's memory would grow from ~10 GB to ~20 GB per instance. The 0.1% rate is the sweet spot: low enough to be invisible in counting accuracy, small enough to fit in memory.
Deep Dive 2: Attribution Under Event Disorder
A mobile user sees an ad at 2:00:00 PM. Their SDK batches the impression event and uploads it at 2:03:00 PM (3-minute delay β normal for mobile). At 2:01:30 PM, the user clicks the ad. The click event arrives at the server at 2:01:31 PM β before the impression event. The click references impressionId: "imp-xyz", but the impression is not yet in the attribution state.
Without late-event handling: the click is marked as an orphan (IMPRESSION_NOT_FOUND). It is excluded from billing. The advertiser paid for the impression but does not get credit for the click that the user genuinely made. Across millions of mobile events, this orphan rate can reach 3-5%, underreporting clicks and deflating CTR for mobile campaigns.
With watermark-based processing: Flink's event-time processing and watermarks handle this. The watermark represents "all events with timestamps before this point have arrived." Flink allows configurable lateness β events arriving up to 5 minutes after the watermark are still processed. The impression at 2:00:00 PM arrives at the server at 2:03:00 PM; if the watermark is currently at 2:01:00 PM (allowing 2 minutes of lateness), the impression is accepted, buffered in state, and the previously orphaned click is retroactively attributed in a side-output correction.
With nightly reconciliation: the nightly pipeline reads the full day's events in event-time order (sorted from Kafka). Late impressions that missed the stream watermark are now present. The click-to-impression join succeeds. The orphan rate drops from 3-5% (stream-only) to under 0.2% (after reconciliation). The advertiser's billing accurately reflects the clicks their ads generated.
The tradeoff: maintaining two versions of the truth (provisional and final) requires clear communication. The dashboard shows a label: "Provisional β final numbers available after T+1 reconciliation." Finance never uses dashboard numbers for invoicing.
Deep Dive 3: Invalid Traffic and the Trust Tax
A bot farm rotates through 10,000 IP addresses and generates click events at a rate of 500 clicks/second across 50 campaigns. Each click references a real impressionId (scraped from the publisher's page source). The clicks pass deduplication (each has a unique eventId). They pass attribution (the impressionId exists and is within the window). Without fraud detection, these 500 clicks/second are billed to advertisers.
At $0.50 CPC average, that is $250/second, $900,000/hour, $21.6 million/day of fraudulent billing.
The scoring model evaluates each event on multiple signals:
- Click cadence: the same
userKeyclicking 50 ads in 10 seconds is physically impossible for a human. Score: +40. - IP reputation: the IP is in a known hosting/proxy range. Score: +15.
- Device fingerprint: the user agent claims to be a mobile browser, but TLS fingerprinting shows a headless Chrome instance. Score: +20.
- CTR anomaly: the publisher's click-through rate for this campaign is 12% (industry average is 1-3%). Score: +15.
Total score: 90 (threshold: 60). Decision: INVALID. The event is excluded from billable aggregates. The InvalidTrafficDecision row records the score, model version, and reason codes.
The false-positive risk: if the threshold is too aggressive, legitimate high-engagement campaigns might be flagged. A gaming ad on a gaming site legitimately has 5-8% CTR. The model must be tuned per vertical, and advertisers must have an appeal workflow to challenge invalid-traffic decisions. Each appeal references the specific FraudModelRun, score, and reasonCodes β making the dispute resolution process data-driven rather than adversarial.
Deep Dive 4: Reprocessing Without Breaking Trust
The fraud team deploys a new model version that reclassifies 2% of previously valid traffic as invalid for a major advertiser's March campaigns. The billing impact is $1.2 million in refunds. How does the system handle this without creating silent inconsistencies?
The naive approach: rerun the pipeline, overwrite the billing aggregates. The advertiser's March invoice suddenly shows a different number than what was previously reported. Finance cannot explain the change. Audit fails because the original numbers are gone.
The safe approach: immutable raw events + versioned output snapshots + explicit publish gate.
- The
POST /v1/reprocessendpoint creates aReprocessJobrecord with the date range, reason, and new model version. - The reprocessing pipeline reads immutable raw events from the archive (unchanged since original ingest).
- It produces new
FinalMetricSnapshotrows withsnapshotVersion = N+1, alongside the originalsnapshotVersion = N. - A data quality check compares the two versions: "Campaign #789: billable clicks decreased by 2.1%, spend decreased by $1.2M."
- A human approver reviews the comparison, confirms the change is expected, and promotes version N+1 as the active billing source.
- The old version N is retained permanently for audit.
At no point are numbers silently overwritten. Every published figure is traceable to a specific pipeline run, model version, and input snapshot.
8. Tying It All Together
| Step | Problem Solved | Components Added | Number That Forced It |
|---|---|---|---|
| A | Durable ingest at extreme throughput | Ingest API fleet + Kafka (128 partitions) | 5M peak events/s; 1.6 GB/s raw bandwidth |
| B | Duplicate suppression under retries | Bloom filter + RocksDB dedupe + Flink | 5.2 TB dedupe state; 25% retry storms |
| C | Click-to-impression attribution | Flink stateful join with 30min window | 50 GB attribution state; 3-5% orphan rate without join |
| D | Dual-path serving + fraud filtering | ClickHouse (dashboards) + Postgres (billing) + Fraud model | 10s dashboard freshness vs T+1 billing; $21.6M/day fraud exposure |
| Entity | In v1? | In v2? | What Triggered the Change |
|---|---|---|---|
| ImpressionEvent/ClickEvent | Yes | Yes (+ EventEnvelope) | Schema versioning for pipeline evolution |
| AttributionLink | Yes | Evolved to AttributionState | Stream processor needs active join tracking |
| RealtimeMetricBucket | Yes | Yes (+ latenessAdjustments) | Late events require retroactive corrections |
| BillingDailyAggregate | Yes | Evolved to FinalMetricSnapshot | Versioned snapshots for audit safety |
| FraudModelRun | No | Yes | Reprocessing traceability |
| DataQualityCheckpoint | No | Yes | Automated drift detection |
9. Common Mistakes
Acknowledging before durable commit. If your ingest API returns 202 before Kafka confirms the write, a broker failure loses events. At 5M events/s, even 1 second of loss is 5 million events β potentially millions of dollars in billing discrepancy.
Treating eventId as globally unique without scoping. Two publishers might independently generate the same UUID. Without publisherId in the idempotency key, their events collide and one is silently dropped.
Mixing provisional and final metrics in one API. If the dashboard shows "1.2M impressions" and the invoice says "1.18M impressions," and both come from the same endpoint without a label, every stakeholder loses trust.
Assuming strict event ordering. Mobile events arrive out of order. If your attribution join requires clicks to arrive after impressions, you will orphan 3-5% of legitimate mobile clicks.
Running fraud filtering only in the nightly batch. This means the real-time dashboard shows 12% CTR (inflated by bots) for an entire day before the fraud model corrects it. Advertisers make budget decisions on these numbers. At minimum, apply lightweight heuristic filtering in the real-time path.
Reprocessing without versioned snapshots. If a fraud model update overwrites billing numbers without retaining the original, you cannot explain the change to an advertiser, survive an audit, or roll back a bad model.
10. What the Interviewer Is Actually Evaluating
Do you define counting semantics before building? "What counts as a billable impression?" is a business question that the architecture must answer. Served vs viewable vs billable β if you do not ask, your design is ungrounded.
Can you design for multi-million events/s with durable buffering? Kafka as the ingestion spine, partitioned by campaign, with stateless ingest nodes. The numbers must drive the partition count and fleet size.
Do you reason about deduplication correctness under disorder? The tiered dedupe model (Bloom + exact store), the scoped idempotency key, and the 24h TTL window show that you understand the problem is not just "check if we've seen this ID."
Do you separate dashboard speed from billing truth? The dual-path architecture β ClickHouse for 10-second provisional metrics, Postgres for T+1 reconciled billing β demonstrates mature data platform thinking.
Do you include fraud controls and explain their business impact? $21.6M/day in potential fraudulent billing is not a theoretical risk. The scoring model, invalid-traffic exclusion, and appeal workflow show you understand that ad tech is an adversarial environment.
Final Thought
Ad tracking is a counting problem that sounds simple and is not. Every event must be captured, deduplicated, attributed, scored for validity, aggregated into provisional metrics, reconciled into billing-grade truth, and made reproducible for audits and disputes.
We started with a Kafka topic and a batch ingest endpoint. We ended with a multi-stage streaming pipeline β deduplication with tiered Bloom+RocksDB, attribution joins with Flink's stateful processing, dual-path serving through ClickHouse and Postgres, fraud scoring with model versioning, and reprocessing with immutable lineage. Not because we drew it all on day one, but because each counting challenge β retries, late events, bots, billing disputes β demanded exactly one precise architectural response.
In ad tech, the numbers on the invoice are the product. Get them right.
Continue Learning
E-Commerce
PRODesign a Price Tracker Platform
It is Black Friday morning, 3:47 AM Eastern. A PlayStation 5 drops from $499 to $379 on Amazon β a $120 discount that will last exactly 94 seconds before inventory sells out. In your database, 847,...
Web Services
Design a URL Shortener
A URL shortener looks trivial. Accept a long URL, return a short one, redirect anyone who clicks it. You could build a working prototype in an afternoon with a hash map and a web server.
Notifications
Design a Notification System
You receive dozens of notifications every day β a shipping update, a login verification code, a friend's message, a price drop alert. Each one feels trivial. But behind that "Your order has shipped...