šŸŽ‰ Launch Sale: Get 30% off annual plans with code LAUNCH30

← Back to Blog
Advertising•Advanced•14 min read

Design an Ad Click + Impression Tracking System

Tech:PartitioningReplicationConsistencyAvailabilityCDNKafka

Design an Ad Click + Impression Tracking System (Interview Walkthrough)

Ad measurement systems break when teams optimize ingestion throughput but ignore correctness contracts such as deduplication, attribution windows, and anti-fraud boundaries. This design prioritizes high-volume event capture with deterministic counting semantics and low-loss pipelines.

---

0) Pre-Design Research Inputs

  1. Impression and click tracking must define counting semantics up front (served vs viewable vs billable), or every downstream metric becomes disputable.
  2. Events are write-heavy and bursty; queue-first ingestion is mandatory to smooth spikes and protect storage.
  3. Exact-once end-to-end is unrealistic at large scale; practical design uses at-least-once delivery plus idempotent writes and bounded dedupe windows.
  4. Near-real-time dashboards and billing-grade aggregates have different freshness and correctness needs and should be separated.
  5. Fraud and bot traffic can materially distort CTR and spend, so quality scoring and invalid-traffic controls belong in core design, not as an afterthought.

Design impact:

  • split online ingest from offline reconciliation,
  • treat event identity and dedupe keys as first-class,
  • support dual-path outputs: fast analytics and finance-grade rollups.

---

1) Requirements

Functional

  1. Record ad impressions and clicks from web/mobile/server channels.
  2. Deduplicate retries/replays and preserve attribution relationship (click must map to prior eligible impression).
  3. Provide near-real-time metrics (impressions, clicks, CTR) by campaign, ad, geo, and device.
  4. Produce billing-grade daily aggregates with invalid-traffic filtering and reconciliation.
  5. Support reprocessing/backfill when fraud models or attribution rules change.

Non-functional

  1. Sustained ingest 1.5M events/s, peak 5M events/s for short bursts.
  2. Ingestion availability 99.99% with no central SPOF.
  3. Data loss under 0.01% for accepted events; late event tolerance up to 24h.
  4. Dashboard freshness under 10s; billing closure within T+1.
  5. Regional data handling and retention controls (for example 90d raw hot retention + archive).

In scope:

  • ingest API, stream pipeline, dedupe/idempotency, attribution join, OLAP serving, billing aggregates, invalid-traffic hooks.

Out of scope:

  • ad auction ranking logic, creative serving CDN internals, advertiser UI details.

Now we quantify scale to decide partitioning, queue depth, and storage layout.

---

2) Capacity Estimation

Assumptions:

  • average 1.5M events/s, peak 5M events/s,
  • event mix: 85% impressions, 15% clicks,
  • average serialized event payload 320 bytes before replication,
  • replication factor 3 for durable streaming/storage path.

Throughput:

  • average ingest bandwidth: 1.5M * 320B = 480 MB/s (~3.8 Gbps) raw.
  • peak ingest bandwidth: 5M * 320B = 1.6 GB/s (~12.8 Gbps) raw.
  • with RF=3, durable write pressure at peak is ~4.8 GB/s.

Daily volume:

  • average events/day: 1.5M * 86400 = 129.6B events/day.
  • raw/day: 129.6B * 320B = 41.5 TB/day.
  • replicated durable footprint/day (RF=3): ~124.5 TB/day.

Hot-state implications:

  • 24h dedupe window needs a key store sized for accepted event IDs.
  • if dedupe key footprint is 40 bytes effective (key + metadata + overhead), daily memory/disk for dedupe index is roughly 129.6B * 40B = 5.18 TB logical, so it must be sharded and mostly SSD-backed (not pure RAM).

What this forces architecturally:

  • partitioned append-only log as ingestion spine,
  • stateless ingest tier with deterministic idempotency key derivation,
  • separate stores for raw immutable events, real-time aggregates, and billing snapshots.

Now we define entities around these write/read patterns.

---

3) Core Entities v1

  • ImpressionEvent(eventId, impressionId, adId, campaignId, publisherId, userKey, device, geo, ts, requestId, signature)
  • ClickEvent(eventId, clickId, impressionId, adId, campaignId, publisherId, userKey, device, geo, ts, requestId, signature)
  • DedupeRecord(idempotencyKey, firstSeenTs, source, status, partitionKey)
  • AttributionLink(clickId, impressionId, attributionModel, lookbackWindowSec, isValid, reasonCode)
  • RealtimeMetricBucket(bucketTs, campaignId, adId, geo, device, impressions, clicks, invalidEvents)
  • BillingDailyAggregate(day, advertiserId, campaignId, billableImpressions, billableClicks, spendMicros, adjustments, version)
  • InvalidTrafficDecision(eventId, score, modelVersion, decision, reason, decidedTs)

Why fields matter:

  • eventId and idempotencyKey protect against retry duplication.
  • impressionId on click is the attribution anchor; without it CTR and billing are contestable.
  • version on billing aggregate supports reruns and audit-safe corrections.
  • signature helps reject malformed or spoofed client events at the edge.

Functional Requirement Traceability:

  • FR1 ingest -> ImpressionEvent, ClickEvent.
  • FR2 dedupe/attribution -> DedupeRecord, AttributionLink.
  • FR3 real-time analytics -> RealtimeMetricBucket.
  • FR4 billing + invalid traffic -> BillingDailyAggregate, InvalidTrafficDecision.
  • FR5 reprocessing -> immutable raw events + versioned aggregate outputs.

With entities defined, we can expose system contracts.

---

4) API / Interface

  • POST /v1/events/impressions:batch
  • POST /v1/events/clicks:batch
  • POST /v1/events/validate-signature
  • GET /v1/metrics/realtime?campaignId=&from=&to=&dimensions=
  • GET /v1/billing/daily?advertiserId=&day=
  • POST /v1/reprocess?from=&to=&reason=&modelVersion=

Batch payload contract:

  • required: eventId, ts, adId, campaignId, publisherId, requestId.
  • impression-only: impressionId.
  • click-only: clickId, impressionId.
  • optional context: userKey, device, geo, ipHash, uaHash.

Error semantics:

  • 202 accepted for async ingest after schema and auth pass.
  • 400 malformed payload or missing required identity fields.
  • 401/403 bad publisher credentials/signature.
  • 409 duplicate idempotency key in active dedupe window.
  • 422 event timestamp outside accepted skew window.
  • 429 per-publisher rate limit exceeded.

FR to API mapping:

  • FR1 -> impression/click batch ingest endpoints.
  • FR2 -> dedupe via eventId + requestId derived idempotency on ingest.
  • FR3 -> real-time metrics query endpoint backed by stream aggregates.
  • FR4 -> billing daily endpoint from reconciled snapshot tables.
  • FR5 -> reprocess endpoint with explicit range and model version.

With contracts clear, we can build architecture progressively.

---

5) High-Level Design (progressive steps)

Step A: Ingestion entrypoint and durability boundary

Publishers call POST /v1/events/impressions:batch and POST /v1/events/clicks:batch with eventId, ts, adId, campaignId, and publisher credentials; the platform validates schema and auth before acknowledging. The question is where to acknowledge events—before or after durable persistence.

Options:

  • Option 1: write directly to OLTP/OLAP storage before ack.
  • Option 2: validate minimally, append to durable log, then ack.

Why Option 1 is weak here:

  • peak spikes can overload primary storage and increase drop risk in hot path.

Decision:

  • Option 2 queue-first ingest; ack on durable log commit.

How this solves:

  • isolates producer burst from downstream consumers and gives replay capability.

Tradeoff:

  • introduces eventual consistency for dashboards and billing outputs.

Step B: Dedupe and idempotency semantics

Each event carries eventId and requestId; the platform derives an idempotency key and returns 409 for duplicates within the active dedupe window. The question is how to enforce counting correctness under retries now that durable ingest is in place.

Bad:

  • dedupe only by eventId generated client-side with no publisher scoping.
  • Why this fails: ID collisions or malicious reuse can suppress valid events.
  • Example: two publishers emit same buggy UUID sequence and undercount each other.

Good:

  • idempotency key = hash(publisherId + eventId + eventType) with 24h TTL store.
  • Why this works: scoped uniqueness avoids cross-tenant collision.
  • Example: repeated mobile retry after timeout is dropped safely as duplicate.
  • Tradeoff: large dedupe-state footprint on SSD clusters.

Great:

  • scoped key above + probabilistic front filter (Bloom/Cuckoo) + exact KV fallback.
  • Why this is great: reduces exact-store read amplification at multi-million EPS.
  • Example: front filter rejects most clear duplicates in-memory; uncertain cases hit exact store.
  • Numeric effect: if 8% retries and 90% filtered before KV, dedupe KV read QPS drops materially.
  • Tradeoff: operational complexity and false-positive tuning in front filter.

Step C: Attribution correctness

ClickEvent contains impressionId linking back to a prior ImpressionEvent; the stream processor must join clicks to impressions within the configured lookback window to populate AttributionLink. The question is how to connect clicks to eligible impressions for CTR and billing correctness.

Bad:

  • count clicks independently without impression join.
  • Why this fails: allows orphan clicks, inflates CTR, and breaks billing trust.
  • Example: bot-generated clicks without prior impression appear billable.

Good:

  • stream join click->impression using impressionId within lookback window.
  • Why this works: only attributable clicks become valid for billing metrics.
  • Example: click at T+2m joins impression at T, valid under 30m window.
  • Tradeoff: late/ out-of-order event handling increases state and watermark tuning.

Great:

  • dual-stage attribution: real-time provisional join + nightly reconciliation join.
  • Why this is great: fast dashboards and finance-grade corrections coexist.
  • Example: late impression arriving after watermark is repaired in nightly pass.
  • Numeric effect: provisional error can drop from ~1-2% to sub-0.2% after T+1 reconciliation.
  • Tradeoff: two metric states require clear product semantics (provisional vs final).

Step D: Serving and anti-fraud integration

Real-time dashboards query GET /v1/metrics/realtime?campaignId=&from=&to=&dimensions= against streaming aggregates; billing uses GET /v1/billing/daily?advertiserId=&day= from reconciled snapshots with invalid-traffic filtering applied. The question is how to separate serving paths by correctness needs—fast analytics versus finance-grade outputs.

Options:

  • Option 1: single aggregate table for all use cases.
  • Option 2: real-time OLAP path + reconciled billing warehouse path.

Why Option 1 is risky:

  • conflicting freshness/correction requirements lead to either stale dashboards or unstable invoices.

Decision:

  • Option 2 dual-path outputs with shared raw-event lineage.

How this solves:

  • dashboard path targets <10s freshness; billing path enforces invalid-traffic model, adjustments, and audit versioning.

Tradeoff:

  • duplicated compute pipelines and stronger lineage governance needed.

After these additions, entities evolve for lifecycle, lineage, and reconciliation controls.

---

Core Entities v2

  • EventEnvelope(eventGuid, tenantId, eventType, logicalTs, ingestTs, schemaVersion, traceId)
  • AttributionState(impressionId, firstSeenTs, expiryTs, joinedClickCount, lastWatermarkTs, status)
  • ProvisionalMetricBucket(bucketTs, dimsHash, impressions, clicks, invalid, latenessAdjustments)
  • FinalMetricSnapshot(day, dimsHash, billableImpressions, billableClicks, spendMicros, snapshotVersion, generatedTs)
  • FraudModelRun(runId, modelVersion, windowStart, windowEnd, threshold, precision, recall)
  • ReprocessJob(jobId, rangeStart, rangeEnd, reason, inputSnapshot, outputSnapshot, status, startedAt, endedAt)
  • DataQualityCheckpoint(checkpointId, windowStart, windowEnd, expectedEvents, acceptedEvents, duplicateRate, lateRate, driftFlags)

Why changed:

  • Added explicit event envelope versioning, attribution state tracking, provisional/final metric split, and reproducible reprocess metadata.

With full architecture in place, we stress-test high-risk decisions.

---

6) Deep Dives

Deep Dive 1: Duplicate suppression at extreme retry rates

Now that pipeline stages are clear, we test whether retry storms break counting.

Bad:

  • no persistent dedupe; rely on producer not retrying.
  • Why this fails: network retries can inflate impressions/clicks and spend.
  • Example: gateway timeout triggers 3 retries, all counted as unique.
  • Numeric implication: at 10% retry during incident, reported spend can be overstated by similar order.

Good:

  • persistent idempotency store with bounded TTL and deterministic key.
  • Why this works: repeated deliveries collapse to single accepted event.
  • Example: same key seen again within 24h returns duplicate decision.
  • Numeric implication: duplicate inflation stays near store miss/error rate instead of retry rate.
  • Tradeoff: heavy write/read pressure on dedupe cluster.

Great:

  • tiered dedupe (in-memory probabilistic filter + sharded exact store) with backpressure.
  • Why this is great: maintains dedupe accuracy while controlling tail latency at high QPS.
  • Example: during 5M EPS burst, ingest still holds p99 because most duplicate checks avoid exact reads.
  • Numeric implication: dedupe latency budget can stay under single-digit milliseconds on hot path.
  • Tradeoff: tuning filter false positives and rebalancing shards is non-trivial.

Deep Dive 2: Attribution window and late-event behavior

With dedupe stable, next risk is attribution drift from event disorder.

Bad:

  • strict event-time ordering assumption with no late-event policy.
  • Why this fails: mobile/offline uploads create orphan clicks and volatile CTR.
  • Example: click arrives before delayed impression and is permanently marked invalid.
  • Numeric implication: late events can bias campaign CTR by measurable basis points at scale.

Good:

  • watermark-based stream join plus configurable lookback (30m, 24h for some channels).
  • Why this works: expected disorder is handled while bounding state size.
  • Example: click buffered until watermark advances enough to resolve join.
  • Numeric implication: orphan rate decreases as window matches channel behavior.
  • Tradeoff: larger windows increase state memory and join CPU.

Great:

  • channel-aware windows + provisional validity + nightly corrective join.
  • Why this is great: protects dashboard freshness and final correctness together.
  • Example: mobile traffic uses longer late tolerance than web; nightly pass corrects residual misses.
  • Numeric implication: reconciliation reduces final orphan-click error significantly versus stream-only approach.
  • Tradeoff: product and finance teams must align on provisional-vs-final metric semantics.

Deep Dive 3: Invalid traffic and fraud controls

Now we evaluate whether metrics remain trustworthy under adversarial traffic.

Bad:

  • block only obvious malformed payloads, no behavioral fraud scoring.
  • Why this fails: sophisticated bot farms pass basic schema checks and poison billing.
  • Example: high-frequency click bursts from rotating IPs evade simple rules.
  • Numeric implication: CTR inflation can trigger bad budget allocation and invoice disputes.

Good:

  • rules + model scoring pipeline marks suspicious events and excludes from billable aggregates.
  • Why this works: combines deterministic heuristics with adaptive detection.
  • Example: impossible click cadence per userKey gets downgraded to invalid.
  • Numeric implication: invalid-traffic rate is observable per publisher and can drive enforcement.
  • Tradeoff: false positives risk under-billing if thresholds are too aggressive.

Great:

  • online lightweight scoring + offline model retraining + appeal/audit workflow.
  • Why this is great: near-real-time protection with explainable post-hoc adjustments.
  • Example: advertiser dispute references model version and reason codes for each adjustment batch.
  • Numeric implication: dispute resolution time and adjustment variance both improve with lineage.
  • Tradeoff: governance overhead for model lifecycle and policy tuning.

Deep Dive 4: Real-time dashboard vs billing truth

With traffic quality controls in place, we test output contract clarity.

Bad:

  • expose one number everywhere without indicating correction state.
  • Why this fails: dashboards and invoices diverge later, causing trust erosion.
  • Example: finance exports noon dashboard number as final bill.

Good:

  • explicit provisional real-time metrics and separate final daily snapshots.
  • Why this works: consumers know expected drift and closure schedule.
  • Example: dashboard shows rolling data; invoice references snapshot version at T+1.
  • Numeric implication: expected reconciliation delta can be tracked and bounded by SLA.
  • Tradeoff: extra education and API/documentation complexity.

Great:

  • contract-level metric states + automated reconciliation reports + alerting on abnormal deltas.
  • Why this is great: metric drift is observable, auditable, and operationally actionable.
  • Example: if provisional-final delta exceeds threshold for a campaign, pipeline owners get paged.
  • Numeric implication: large hidden billing errors are caught early.
  • Tradeoff: additional reporting jobs and on-call signals to tune.

Deep Dive 5: Reprocessing safety and reproducibility

Finally, we verify that correction runs do not create silent inconsistencies.

Bad:

  • ad-hoc backfill jobs overwrite aggregates without snapshot lineage.
  • Why this fails: teams cannot explain why numbers changed after disputes.
  • Example: rerun with new fraud threshold updates totals but prior inputs are lost.

Good:

  • immutable raw log + versioned output snapshots for each reprocess job.
  • Why this works: every published number is reproducible from specific inputs and code/model version.
  • Example: snapshotVersion=42 links to modelVersion=fraud-2026-03-10.
  • Numeric implication: audit queries answer "what changed and by how much" quickly.
  • Tradeoff: storage overhead from retained snapshots.

Great:

  • full lineage graph (data + code + model) and safe publish gate (compare, approve, then promote).
  • Why this is great: corrections are controlled releases, not blind overwrites.
  • Example: promote only when campaign-level deltas are within expected bounds or explicitly approved.
  • Numeric implication: reduces risk of accidental multi-million-dollar billing swings.
  • Tradeoff: slower correction turnaround for urgent incidents.

---

7) Common mistakes

  1. Acknowledging ingest before durable append, then losing events on crash.
  2. Treating eventId as globally unique without tenant scoping.
  3. Mixing provisional and final metrics in the same API contract.
  4. Ignoring late-event policy and assuming strict order in distributed clients.
  5. Running fraud filtering only in batch, allowing near-real-time metrics to be gamed.
  6. Reprocessing without snapshot/version lineage.

---

8) What interviewer evaluates

  • Do you define measurement semantics (impression, click, billable) clearly before architecture?
  • Can you design for multi-million events/sec with durable buffering and partition strategy?
  • Do you reason about idempotency, dedupe windows, and attribution correctness under disorder?
  • Do you separate low-latency analytics from billing-grade reconciliation?
  • Do you include fraud/invalid-traffic controls and explain their business impact?
  • Can you show tradeoffs with concrete numeric implications, not generic component lists?

---

9) References

  • IAB digital ad measurement guidelines: https://www.iab.com/guidelines/
  • Google Ads invalid traffic overview: https://support.google.com/google-ads/answer/11182074
  • Apache Kafka design docs: https://kafka.apache.org/documentation/
  • Apache Flink event time and watermarks: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/time/
  • Designing Data-Intensive Applications (stream processing patterns): https://dataintensive.net/

Key Takeaways

  • 1Record ad impressions and clicks from web/mobile/server channels.
  • 2Deduplicate retries/replays and preserve attribution relationship (`click` must map to prior elig...
  • 3Provide near-real-time metrics (`impressions`, `clicks`, `CTR`) by campaign, ad, geo, and device.
  • 4Produce billing-grade daily aggregates with invalid-traffic filtering and reconciliation.
  • 5Support reprocessing/backfill when fraud models or attribution rules change.

Continue Learning

šŸŽ‰ Launch Sale!

30% off annual plans with code LAUNCH30

View Pricing