🎉 Launch Sale: Get 30% off annual plans with code LAUNCH30

← Back to Blog
Location ServicesAdvanced15 min read

Design a Ride-Hailing System (Uber-Style, Interview Walkthrough)

Tech:ReplicationConsistencyAvailabilityRedisWebSocketIdempotency

Design a Ride-Hailing System (Uber-Style, Interview Walkthrough)

Designing ride hailing is about real-time decisions under uncertainty: live locations, fast matching, ETA quality, and failure-safe trip lifecycle. The goal is to build a correct trip pipeline first, then harden dispatch latency, geospatial scale, and reliability.

---

0) Pre-Design Research Inputs (done before architecture)

Before choosing components, we anchor on external behavior constraints:

  1. WebSocket protocol (RFC 6455) supports persistent bi-directional updates for driver/rider location streams.
  2. Uber H3 describes hierarchical hex indexing for efficient proximity grouping.
  3. Redis GEOSEARCH provides fast radius search semantics on indexed coordinates.
  4. PostGIS KNN / ST_DWithin supports nearest-neighbor + radius filtering on relational GIS data.
  5. HTTP 202 semantics (RFC 7231) fit async workflows where processing completes later.

How this shapes our design:

  • Use persistent realtime channels for location updates.
  • Use geospatial cell indexing to reduce candidate search space.
  • Use async trip state transitions where provider dependencies are not instantaneous.

---

1) Requirements (~5 min)

Now that research constraints are known, let us pin down scope so the design stays focused.

Functional requirements (top 4)

  1. Rider can request a ride with pickup/drop.
  2. Nearby drivers are matched and one is assigned.
  3. Both sides can see realtime trip/location updates.
  4. Trip lifecycle is tracked (REQUESTED -> MATCHED -> ON_TRIP -> COMPLETED/CANCELLED).

Non-functional requirements

  1. Match decision p95 < 2s in hot cities.
  2. High availability for dispatch and trip state updates.
  3. Location ingestion at very high write rate.
  4. Idempotent request/accept flows to avoid duplicate trips.
  5. Graceful handling for driver disconnects/location lag.

Scope control

  • In scope: matching, location stream, trip state, basic ETA pipeline.
  • Out of scope: deep fraud scoring, full ML surge model, payments ledger internals.

---

2) Capacity Estimation (decision-driving)

With requirements fixed, let us quantify load to force architecture decisions.

Assumptions

  • DAU: 30M
  • Daily trips: 25M
  • Peak trip requests: ~30k req/s (city/event bursts)
  • Active online drivers at peak globally: 1.5M
  • Driver location updates: every 3 seconds

Derived load

  • Driver location writes: 1.5M / 3 ~= 500k updates/s
  • Trip create attempts peak: ~30k/s
  • Match lookups peak: same order as trip requests, but each involves geo candidate search

What this forces

  • Location path is write-heavy and must be in-memory/realtime friendly.
  • Matching must avoid global nearest-neighbor scans.

---

2.1) Storage + Ops Estimation

Now estimate state footprint so we avoid accidental bottlenecks.

Assumptions:

  • Driver session record (status, cell, coords, last heartbeat): ~200B
  • Active driver states: 1.5M -> ~300MB raw (before replication/overhead)
  • Trip row average: ~1KB (state transitions + pointers + metadata)

Trip storage:

  • 25M trips/day * 1KB = ~25GB/day raw
  • 365 days raw operational store ~9TB before replication

Implication:

  • Hot operational state (drivers + active trips) should be separated from cold historical analytics.

---

3) Core Entities v1 (~2 min)

Now that load patterns are clear, define entities around dispatch and state transitions.

  • Driver(driverId, vehicleType, status, rating, currentCell, lastLocTs)
  • Rider(riderId, rating, defaultPaymentRef?)
  • RideRequest(requestId, riderId, pickup, dropoff, requestedAt, status)
  • Trip(tripId, requestId, riderId, driverId, state, createdAt, updatedAt)
  • DriverLocation(driverId, lat, lon, speed, heading, ts, cellId)
  • DispatchAttempt(requestId, candidateDriverId, attemptNo, outcome, ts)

Thought process:

  • DriverLocation is high-churn realtime state.
  • Trip is authoritative lifecycle entity.
  • DispatchAttempt helps tuning and debugging failed/slow matches.

Functional requirement traceability:

  • FR1 (rider requests ride) -> RideRequest + Trip track creation and lifecycle transition.
  • FR2 (nearby driver match) -> DriverLocation, Driver.currentCell, and DispatchAttempt support candidate search and offer outcome tracking.
  • FR3 (realtime updates) -> DriverLocation and DriverSession feed location/trip events.
  • FR4 (trip lifecycle) -> Trip.state + Trip.updatedAt + DispatchAttempt provide auditable state progression.

Why this mapping matters:

  • We are not adding entities as generic tables; each entity exists because one or more functional requirements needs a specific query or state transition.

---

4) API / System Interface (~5 min)

With entities in place, define contracts that map directly to trip flow.

Rider APIs

  • POST /v1/ride-requests (idempotency key required)
  • GET /v1/trips/{tripId}
  • POST /v1/trips/{tripId}/cancel

Driver APIs

  • POST /v1/driver/location (or websocket stream event)
  • POST /v1/dispatch/{requestId}/accept
  • POST /v1/dispatch/{requestId}/reject

Realtime channels

  • WebSocket topics/events:

- tripUpdate - driverLocationUpdate - dispatchOffer

Parameter reasoning:

  • idempotencyKey on ride request avoids duplicate trip creation on mobile retries.
  • requestId on dispatch accept/reject ties decision to exact offer instance.
  • ts on location updates enables stale update rejection.

Error semantics:

  • 409 Conflict when request already matched/cancelled.
  • 202 Accepted for async workflows where final match completes shortly after enqueue.
  • 429 for throttling abusive clients/devices.

Functional requirement to API mapping:

  • FR1 (request ride) -> POST /v1/ride-requests with idempotency key to prevent duplicate trips on retries.
  • FR2 (match driver) -> internal dispatch flow uses request context and POST /v1/dispatch/{requestId}/accept|reject for candidate resolution.
  • FR3 (realtime updates) -> websocket events tripUpdate, driverLocationUpdate, dispatchOffer keep rider/driver clients in sync.
  • FR4 (lifecycle visibility) -> GET /v1/trips/{tripId} and cancel endpoint expose current state safely.

Why this mapping matters:

  • Each endpoint is justified by a requirement. If an endpoint does not map to a requirement, it should not be in the interview scope.

---

5) High-Level Design (progressive)

Now build architecture in sequence: correctness first, then match speed, then resilience.

Step A: Minimal trip creation and assignment baseline

Rider calls POST /v1/ride-requests with pickup/dropoff and idempotencyKey. The service creates a RideRequest, transitions it through REQUESTED -> MATCHED -> ON_TRIP -> COMPLETED/CANCELLED, and returns trip state via GET /v1/trips/{tripId}. Driver receives dispatchOffer via WebSocket and responds with POST /v1/dispatch/{requestId}/accept or reject. The question is: how do we ensure exactly one valid trip state when request, cancel, and accept can race?

Options we could pick:

  • Option A: eventually-consistent key-value writes for lifecycle.
  • Option B: transactional lifecycle store with explicit state transitions.

Why Option A is weaker here:

  • it scales well for append traffic, but request/accept/cancel races are harder to enforce safely.

Decision:

  • choose a transactional DB for RideRequest and Trip lifecycle transitions.

What this decision unlocks:

  • strict state machine checks (REQUESTED -> MATCHED -> ON_TRIP -> COMPLETED/CANCELLED)
  • clean conflict handling (409) for late/duplicate transitions
  • idempotent ride-create semantics under client retries.

Components for this step:

  • Ride API service
  • Trip store (transactional DB)
  • Basic dispatch service

How with numbers:

  • at 30k req/s request peaks, even a small race rate creates many conflicting transitions per second.
  • strict transactional checks reduce illegal state writes at the cost of higher coordination overhead.

Tradeoff:

  • more write coordination than pure append stores, but this is the correct tradeoff for lifecycle correctness.

Step B: Geospatial candidate discovery at scale

When POST /v1/ride-requests is received, the dispatch service must find nearby available drivers. Drivers continuously stream POST /v1/driver/location updates with lat, lon, and ts, which are indexed in a geospatial store. The service queries for candidates within a radius of the pickup point. The question is: how do we find nearby drivers quickly without scanning all online drivers?

Options we could pick:

  • Option A: brute-force nearest search across all online drivers.
  • Option B: geospatial cell index, then local ring/radius candidate search.

Why Option A is weaker here:

  • with ~500k location updates/s and ~30k dispatch req/s, brute-force nearest search is computationally wasteful and latency-unstable.

Decision:

  • use cell-based indexing (H3/S2) and search local cells + neighboring rings before ranking.

What this decision unlocks:

  • bounded candidate set for every request,
  • predictable dispatch latency in dense and sparse areas,
  • scalable matching pipeline.

Components added:

  • Geospatial index service (H3/S2 cell mapping)
  • Hot driver location store (Redis geo or equivalent)

How with numbers:

  • candidate pool shrinks from broad city-level sets to tens/hundreds before scoring.
  • this keeps match latency within budget while preserving enough options for quality matching.

Tradeoff:

  • requires careful cell-size/ring tuning near boundaries.

Step C: Realtime delivery and dispatch offers

After candidates are ranked, the dispatch service sends a dispatchOffer event to the top driver via WebSocket. The driver must respond with POST /v1/dispatch/{requestId}/accept or reject within the offer timeout. If timeout/reject occurs, the service offers to the next candidate. The question is: how do we deliver offers and receive responses fast enough to meet the match SLA?

Options we could pick:

  • Option A: polling-based offer retrieval.
  • Option B: push offers on persistent realtime channels.

Why Option A is weaker here:

  • polling intervals add fixed delay; in mobile networks this quickly consumes dispatch budget.

Decision:

  • use websocket push for dispatch offers with short accept timeout and fast fallback to next candidate.

What this decision unlocks:

  • lower round-trip overhead,
  • faster re-offer path on timeout/reject,
  • better match success under tight SLA.

Components added:

  • WebSocket gateway cluster
  • Offer queue/stream
  • Driver session registry

How with numbers:

  • if polling interval is 1s, two poll cycles can already consume most of a <2s match target.
  • push model removes this idle wait and keeps offer loop reactive.

Tradeoff:

  • session lifecycle/reconnect handling becomes a core operational concern.

Step D: Reliability hardening and degraded modes

The dispatch flow depends on fresh DriverLocation data, healthy WebSocket delivery, and ETA service responses. When POST /v1/driver/location updates stop arriving (stale drivers), or WebSocket dispatchOffer delivery fails, or ETA calls timeout, the system must continue functioning. The question is: how do we keep dispatch available when dependencies are degraded?

Options we could pick:

  • Option A: hard dependency chain (if one dependency fails, matching fails).
  • Option B: graceful degradation with bounded quality loss.

Why Option A is weaker here:

  • partial outages become full dispatch outages.

Decision:

  • degrade gracefully with stale-driver quarantine, adaptive radius widening, and fallback ETA confidence bands.

What this decision unlocks:

  • dispatch availability remains high during partial failures,
  • quality drops temporarily instead of complete service interruption.

Components added:

  • Stale-location detector
  • Fallback matching mode
  • Replay/reconciliation workers for stuck trips

How with numbers:

  • if ~5% driver sessions become stale in a region, freshness filtering plus fallback search prevents cascading timeout loops.

Tradeoff:

  • temporary drop in ETA precision and match optimality during degraded operation.

---

Core Entities v2 (after evolution)

After architecture hardening, update data model to reflect operational realities:

  • Driver(driverId, status, currentCell, lastLocTs, staleFlag, regionId)
  • DispatchAttempt(requestId, candidateDriverId, rankScore, offerTs, expiryTs, outcome)
  • Trip(tripId, state, stateVersion, assignedAt, pickupEtaSec, cancelReason?)
  • DriverSession(driverId, socketId, serverId, lastHeartbeatTs, connState)
  • CellSupply(cellId, availableDrivers, demandRequests, updatedAt) (for surge/dispatch balancing)

Why changed:

  • Added stateVersion for optimistic transition safety.
  • Added staleFlag and expiryTs to control invalid offers and timeout loops.
  • Added cell-level supply snapshot for fast balancing and surge inputs.

---

6) Deep Dives (numeric + mechanism)

With full design in place, now stress-test the highest-risk areas in order.

Deep Dive 1: Candidate search quality vs speed

Now that matching exists, first verify whether geospatial search is both fast and accurate.

Bad:

  • One fixed tiny radius search and immediate failure if no driver is found.
  • Why bad: this behaves okay only in dense downtown cells; suburban and airport edges fail often.
  • Example: rider requests from low-density area where nearest driver is 2.8 km away, but system only searches 1 km radius.
  • How technically:

- dispatch returns "no drivers" too early, - client retries repeatedly, - request load rises while completion rate drops.

  • Why not acceptable:

- we create self-inflicted retry traffic and poor rider conversion.

Good:

  • Adaptive radius expansion (ring 1 -> ring 2 -> ring 3) with timeout budget.
  • Why good:

- keeps fast path in dense cells, - broadens search only when needed.

  • Example:

- try ring-1 for 200ms, ring-2 for next 300ms, ring-3 until 1.5s total budget.

  • How technically:

- bounded progressive search avoids both under-search and unbounded scans.

Great:

  • Adaptive radius + supply-aware scoring by cell.
  • Example: if local supply low, widen ring and prioritize shorter ETA confidence.
  • How technically:

- combine distance with freshness and expected pickup time score, - prune candidates with stale heartbeat or poor acceptance history.

  • Why this solves:

- improves both match success and ETA realism under uneven supply.

  • Tradeoff: more tuning complexity.

Deep Dive 2: Stale location handling

Now we evaluate location freshness, because bad location data causes wrong dispatch.

Bad:

  • Use last location indefinitely.
  • Why bad: disconnected/offline driver still gets offers.
  • Example:

- driver app goes background or loses network, last ping remains in index. - dispatch keeps offering rides to this phantom "nearby" driver.

  • How technically:

- offer timeout chain grows, - rider wait time and cancellation probability increase.

  • Why not acceptable:

- stale data poisons candidate quality and lowers completion rate.

Good:

  • Mark stale if now - lastLocTs exceeds threshold (e.g., 10-15s).
  • Why good:

- quick filter removes obviously stale candidates before offer.

  • Example:

- any driver older than 12s heartbeat is removed from eligible set.

Great:

  • Freshness score in ranking + heartbeat gating for offer eligibility.
  • How with numbers:

- if 500k updates/s ingest drops in a region, freshness gating prevents false positives from stale entries.

  • Additional mechanism:

- keep hard cutoff for eligibility plus soft freshness penalty in ranking.

  • Tradeoff:

- overly strict thresholds can reduce supply visibility in weak-network zones.

Deep Dive 3: Offer acceptance race

Next, test the concurrency edge where multiple drivers may accept same request.

Bad:

  • First accept wins without atomic guard.
  • How: duplicate acceptance can create inconsistent trip assignment.
  • Example:

- two offers delivered nearly simultaneously due to retry/rebalance. - both drivers tap accept within 100ms.

  • Why bad:

- without atomic state transition, both accept paths may proceed.

  • Consequence:

- double assignment, cancellation churn, and support incidents.

Good:

  • Atomic compare-and-set on request/trip state (REQUESTED -> MATCHED once).
  • Why good:

- only one write can successfully transition state.

  • Example:

- first accept updates state version; second receives conflict and is rejected.

Great:

  • CAS + offer token + expiry window.
  • Why: only valid non-expired offer can transition state.
  • How technically:

- token ties accept call to a specific dispatch attempt, - expired token blocks late accepts after reassignment.

  • Why this solves:

- protects against out-of-order network delivery and delayed clients.

  • Tradeoff: extra state tracking.

Deep Dive 4: Realtime channel failure

Now evaluate what happens when websocket delivery degrades.

Bad:

  • No fallback when driver socket is disconnected.
  • Why bad:

- dispatch waits for timeout before moving to next candidate, burning dispatch budget.

  • Example:

- driver socket dropped; offer never reaches device; rider waits full timeout repeatedly.

  • How technically:

- each failed attempt consumes offer TTL and increases p95 match latency.

Good:

  • Detect failed push and move to next candidate quickly.
  • Why good:

- reduces wasted time on unreachable candidates.

  • Example:

- if no delivery ack in 300ms, fail fast to next ranked driver.

Great:

  • Multi-channel fallback (websocket + push wakeup + re-offer queue) with bounded retries.
  • How technically:

- websocket primary, - push wakeup for sleeping app, - re-offer queue prevents infinite loops.

  • Why this solves:

- keeps dispatch alive across mobile network churn.

  • Tradeoff: more moving parts, but higher dispatch success under network churn.

Deep Dive 5: Multi-region dispatch

Finally, verify region behavior for latency and resilience.

Bad:

  • Global single-region dispatch.
  • Why bad: high RTT and large outage blast radius.
  • Example:

- APAC rider and driver matched through US control plane during peak.

  • How technically:

- extra round trips inflate offer latency and timeout risk, - regional outage can pause global matching.

Good:

  • Region-local dispatch control plane.
  • Why good:

- keeps request, candidate search, and offer loop close to users.

  • Example:

- city-region dispatch can keep matching decisions within local latency envelope.

Great:

  • Region-local matching + async cross-region failover only when local supply exhausted.
  • How technically:

- default local dispatch, - cross-region expansion only after bounded local attempts.

  • Why this solves:

- preserves low latency first and uses failover as controlled fallback.

  • Tradeoff: more operational complexity and cross-region consistency decisions.

---

7) Common mistakes

  1. Treating geospatial match as simple nearest-point query without candidate/ranking pipeline.
  2. No stale-location filtering.
  3. Missing idempotency for ride request and driver accept.
  4. No atomic state transition guard for accept race.
  5. Over-coupling ETA/ranking dependencies to dispatch availability.

---

8) What interviewer is evaluating

  • Did you separate high-churn realtime state from authoritative trip lifecycle state?
  • Did you map throughput numbers to matching/indexing choices?
  • Did you handle offer race and stale-driver failure modes explicitly?
  • Did you design graceful degraded modes instead of brittle hard dependencies?

---

9) References

  • RFC 6455 WebSocket: https://www.rfc-editor.org/rfc/rfc6455
  • Uber H3 overview: https://www.uber.com/en-CA/blog/h3/
  • H3 docs: https://uber.github.io/h3/
  • Redis GEOSEARCH docs: https://redis.io/docs/latest/commands/geosearch/
  • PostGIS KNN/ST_DWithin guides: https://www.postgis.net/workshops/postgis-intro/knn.html
  • RFC 7231 (202 Accepted): https://www.rfc-editor.org/rfc/rfc7231.html

Key Takeaways

  • 1Rider can request a ride with pickup/drop.
  • 2Nearby drivers are matched and one is assigned.
  • 3Both sides can see realtime trip/location updates.
  • 4Trip lifecycle is tracked (`REQUESTED -> MATCHED -> ON_TRIP -> COMPLETED/CANCELLED`).
  • 5Match decision p95 < 2s in hot cities.

Continue Learning

🎉 Launch Sale!

30% off annual plans with code LAUNCH30

View Pricing