Design a Ride-Hailing System (Uber-Style, Interview Walkthrough)
Design a Ride-Hailing System (Uber-Style, Interview Walkthrough)
Designing ride hailing is about real-time decisions under uncertainty: live locations, fast matching, ETA quality, and failure-safe trip lifecycle. The goal is to build a correct trip pipeline first, then harden dispatch latency, geospatial scale, and reliability.
---
0) Pre-Design Research Inputs (done before architecture)
Before choosing components, we anchor on external behavior constraints:
- WebSocket protocol (RFC 6455) supports persistent bi-directional updates for driver/rider location streams.
- Uber H3 describes hierarchical hex indexing for efficient proximity grouping.
- Redis GEOSEARCH provides fast radius search semantics on indexed coordinates.
- PostGIS KNN / ST_DWithin supports nearest-neighbor + radius filtering on relational GIS data.
- HTTP 202 semantics (RFC 7231) fit async workflows where processing completes later.
How this shapes our design:
- Use persistent realtime channels for location updates.
- Use geospatial cell indexing to reduce candidate search space.
- Use async trip state transitions where provider dependencies are not instantaneous.
---
1) Requirements (~5 min)
Now that research constraints are known, let us pin down scope so the design stays focused.
Functional requirements (top 4)
- Rider can request a ride with pickup/drop.
- Nearby drivers are matched and one is assigned.
- Both sides can see realtime trip/location updates.
- Trip lifecycle is tracked (
REQUESTED -> MATCHED -> ON_TRIP -> COMPLETED/CANCELLED).
Non-functional requirements
- Match decision p95 < 2s in hot cities.
- High availability for dispatch and trip state updates.
- Location ingestion at very high write rate.
- Idempotent request/accept flows to avoid duplicate trips.
- Graceful handling for driver disconnects/location lag.
Scope control
- In scope: matching, location stream, trip state, basic ETA pipeline.
- Out of scope: deep fraud scoring, full ML surge model, payments ledger internals.
---
2) Capacity Estimation (decision-driving)
With requirements fixed, let us quantify load to force architecture decisions.
Assumptions
- DAU:
30M - Daily trips:
25M - Peak trip requests:
~30k req/s(city/event bursts) - Active online drivers at peak globally:
1.5M - Driver location updates: every 3 seconds
Derived load
- Driver location writes:
1.5M / 3 ~= 500k updates/s - Trip create attempts peak:
~30k/s - Match lookups peak: same order as trip requests, but each involves geo candidate search
What this forces
- Location path is write-heavy and must be in-memory/realtime friendly.
- Matching must avoid global nearest-neighbor scans.
---
2.1) Storage + Ops Estimation
Now estimate state footprint so we avoid accidental bottlenecks.
Assumptions:
- Driver session record (status, cell, coords, last heartbeat):
~200B - Active driver states:
1.5M->~300MBraw (before replication/overhead) - Trip row average:
~1KB(state transitions + pointers + metadata)
Trip storage:
25M trips/day * 1KB = ~25GB/dayraw- 365 days raw operational store ~
9TBbefore replication
Implication:
- Hot operational state (drivers + active trips) should be separated from cold historical analytics.
---
3) Core Entities v1 (~2 min)
Now that load patterns are clear, define entities around dispatch and state transitions.
Driver(driverId, vehicleType, status, rating, currentCell, lastLocTs)Rider(riderId, rating, defaultPaymentRef?)RideRequest(requestId, riderId, pickup, dropoff, requestedAt, status)Trip(tripId, requestId, riderId, driverId, state, createdAt, updatedAt)DriverLocation(driverId, lat, lon, speed, heading, ts, cellId)DispatchAttempt(requestId, candidateDriverId, attemptNo, outcome, ts)
Thought process:
DriverLocationis high-churn realtime state.Tripis authoritative lifecycle entity.DispatchAttempthelps tuning and debugging failed/slow matches.
Functional requirement traceability:
FR1 (rider requests ride)->RideRequest+Triptrack creation and lifecycle transition.FR2 (nearby driver match)->DriverLocation,Driver.currentCell, andDispatchAttemptsupport candidate search and offer outcome tracking.FR3 (realtime updates)->DriverLocationandDriverSessionfeed location/trip events.FR4 (trip lifecycle)->Trip.state+Trip.updatedAt+DispatchAttemptprovide auditable state progression.
Why this mapping matters:
- We are not adding entities as generic tables; each entity exists because one or more functional requirements needs a specific query or state transition.
---
4) API / System Interface (~5 min)
With entities in place, define contracts that map directly to trip flow.
Rider APIs
POST /v1/ride-requests(idempotency key required)GET /v1/trips/{tripId}POST /v1/trips/{tripId}/cancel
Driver APIs
POST /v1/driver/location(or websocket stream event)POST /v1/dispatch/{requestId}/acceptPOST /v1/dispatch/{requestId}/reject
Realtime channels
- WebSocket topics/events:
- tripUpdate - driverLocationUpdate - dispatchOffer
Parameter reasoning:
idempotencyKeyon ride request avoids duplicate trip creation on mobile retries.requestIdon dispatch accept/reject ties decision to exact offer instance.tson location updates enables stale update rejection.
Error semantics:
409 Conflictwhen request already matched/cancelled.202 Acceptedfor async workflows where final match completes shortly after enqueue.429for throttling abusive clients/devices.
Functional requirement to API mapping:
FR1 (request ride)->POST /v1/ride-requestswith idempotency key to prevent duplicate trips on retries.FR2 (match driver)-> internal dispatch flow uses request context andPOST /v1/dispatch/{requestId}/accept|rejectfor candidate resolution.FR3 (realtime updates)-> websocket eventstripUpdate,driverLocationUpdate,dispatchOfferkeep rider/driver clients in sync.FR4 (lifecycle visibility)->GET /v1/trips/{tripId}and cancel endpoint expose current state safely.
Why this mapping matters:
- Each endpoint is justified by a requirement. If an endpoint does not map to a requirement, it should not be in the interview scope.
---
5) High-Level Design (progressive)
Now build architecture in sequence: correctness first, then match speed, then resilience.
Step A: Minimal trip creation and assignment baseline
Rider calls POST /v1/ride-requests with pickup/dropoff and idempotencyKey. The service creates a RideRequest, transitions it through REQUESTED -> MATCHED -> ON_TRIP -> COMPLETED/CANCELLED, and returns trip state via GET /v1/trips/{tripId}. Driver receives dispatchOffer via WebSocket and responds with POST /v1/dispatch/{requestId}/accept or reject. The question is: how do we ensure exactly one valid trip state when request, cancel, and accept can race?
Options we could pick:
- Option A: eventually-consistent key-value writes for lifecycle.
- Option B: transactional lifecycle store with explicit state transitions.
Why Option A is weaker here:
- it scales well for append traffic, but request/accept/cancel races are harder to enforce safely.
Decision:
- choose a transactional DB for
RideRequestandTriplifecycle transitions.
What this decision unlocks:
- strict state machine checks (
REQUESTED -> MATCHED -> ON_TRIP -> COMPLETED/CANCELLED) - clean conflict handling (
409) for late/duplicate transitions - idempotent ride-create semantics under client retries.
Components for this step:
- Ride API service
- Trip store (transactional DB)
- Basic dispatch service
How with numbers:
- at
30k req/srequest peaks, even a small race rate creates many conflicting transitions per second. - strict transactional checks reduce illegal state writes at the cost of higher coordination overhead.
Tradeoff:
- more write coordination than pure append stores, but this is the correct tradeoff for lifecycle correctness.
Step B: Geospatial candidate discovery at scale
When POST /v1/ride-requests is received, the dispatch service must find nearby available drivers. Drivers continuously stream POST /v1/driver/location updates with lat, lon, and ts, which are indexed in a geospatial store. The service queries for candidates within a radius of the pickup point. The question is: how do we find nearby drivers quickly without scanning all online drivers?
Options we could pick:
- Option A: brute-force nearest search across all online drivers.
- Option B: geospatial cell index, then local ring/radius candidate search.
Why Option A is weaker here:
- with
~500k location updates/sand~30k dispatch req/s, brute-force nearest search is computationally wasteful and latency-unstable.
Decision:
- use cell-based indexing (H3/S2) and search local cells + neighboring rings before ranking.
What this decision unlocks:
- bounded candidate set for every request,
- predictable dispatch latency in dense and sparse areas,
- scalable matching pipeline.
Components added:
- Geospatial index service (H3/S2 cell mapping)
- Hot driver location store (Redis geo or equivalent)
How with numbers:
- candidate pool shrinks from broad city-level sets to tens/hundreds before scoring.
- this keeps match latency within budget while preserving enough options for quality matching.
Tradeoff:
- requires careful cell-size/ring tuning near boundaries.
Step C: Realtime delivery and dispatch offers
After candidates are ranked, the dispatch service sends a dispatchOffer event to the top driver via WebSocket. The driver must respond with POST /v1/dispatch/{requestId}/accept or reject within the offer timeout. If timeout/reject occurs, the service offers to the next candidate. The question is: how do we deliver offers and receive responses fast enough to meet the match SLA?
Options we could pick:
- Option A: polling-based offer retrieval.
- Option B: push offers on persistent realtime channels.
Why Option A is weaker here:
- polling intervals add fixed delay; in mobile networks this quickly consumes dispatch budget.
Decision:
- use websocket push for dispatch offers with short accept timeout and fast fallback to next candidate.
What this decision unlocks:
- lower round-trip overhead,
- faster re-offer path on timeout/reject,
- better match success under tight SLA.
Components added:
- WebSocket gateway cluster
- Offer queue/stream
- Driver session registry
How with numbers:
- if polling interval is
1s, two poll cycles can already consume most of a<2smatch target. - push model removes this idle wait and keeps offer loop reactive.
Tradeoff:
- session lifecycle/reconnect handling becomes a core operational concern.
Step D: Reliability hardening and degraded modes
The dispatch flow depends on fresh DriverLocation data, healthy WebSocket delivery, and ETA service responses. When POST /v1/driver/location updates stop arriving (stale drivers), or WebSocket dispatchOffer delivery fails, or ETA calls timeout, the system must continue functioning. The question is: how do we keep dispatch available when dependencies are degraded?
Options we could pick:
- Option A: hard dependency chain (if one dependency fails, matching fails).
- Option B: graceful degradation with bounded quality loss.
Why Option A is weaker here:
- partial outages become full dispatch outages.
Decision:
- degrade gracefully with stale-driver quarantine, adaptive radius widening, and fallback ETA confidence bands.
What this decision unlocks:
- dispatch availability remains high during partial failures,
- quality drops temporarily instead of complete service interruption.
Components added:
- Stale-location detector
- Fallback matching mode
- Replay/reconciliation workers for stuck trips
How with numbers:
- if ~5% driver sessions become stale in a region, freshness filtering plus fallback search prevents cascading timeout loops.
Tradeoff:
- temporary drop in ETA precision and match optimality during degraded operation.
---
Core Entities v2 (after evolution)
After architecture hardening, update data model to reflect operational realities:
Driver(driverId, status, currentCell, lastLocTs, staleFlag, regionId)DispatchAttempt(requestId, candidateDriverId, rankScore, offerTs, expiryTs, outcome)Trip(tripId, state, stateVersion, assignedAt, pickupEtaSec, cancelReason?)DriverSession(driverId, socketId, serverId, lastHeartbeatTs, connState)CellSupply(cellId, availableDrivers, demandRequests, updatedAt)(for surge/dispatch balancing)
Why changed:
- Added
stateVersionfor optimistic transition safety. - Added
staleFlagandexpiryTsto control invalid offers and timeout loops. - Added cell-level supply snapshot for fast balancing and surge inputs.
---
6) Deep Dives (numeric + mechanism)
With full design in place, now stress-test the highest-risk areas in order.
Deep Dive 1: Candidate search quality vs speed
Now that matching exists, first verify whether geospatial search is both fast and accurate.
Bad:
- One fixed tiny radius search and immediate failure if no driver is found.
- Why bad: this behaves okay only in dense downtown cells; suburban and airport edges fail often.
- Example: rider requests from low-density area where nearest driver is 2.8 km away, but system only searches 1 km radius.
- How technically:
- dispatch returns "no drivers" too early, - client retries repeatedly, - request load rises while completion rate drops.
- Why not acceptable:
- we create self-inflicted retry traffic and poor rider conversion.
Good:
- Adaptive radius expansion (ring 1 -> ring 2 -> ring 3) with timeout budget.
- Why good:
- keeps fast path in dense cells, - broadens search only when needed.
- Example:
- try ring-1 for 200ms, ring-2 for next 300ms, ring-3 until 1.5s total budget.
- How technically:
- bounded progressive search avoids both under-search and unbounded scans.
Great:
- Adaptive radius + supply-aware scoring by cell.
- Example: if local supply low, widen ring and prioritize shorter ETA confidence.
- How technically:
- combine distance with freshness and expected pickup time score, - prune candidates with stale heartbeat or poor acceptance history.
- Why this solves:
- improves both match success and ETA realism under uneven supply.
- Tradeoff: more tuning complexity.
Deep Dive 2: Stale location handling
Now we evaluate location freshness, because bad location data causes wrong dispatch.
Bad:
- Use last location indefinitely.
- Why bad: disconnected/offline driver still gets offers.
- Example:
- driver app goes background or loses network, last ping remains in index. - dispatch keeps offering rides to this phantom "nearby" driver.
- How technically:
- offer timeout chain grows, - rider wait time and cancellation probability increase.
- Why not acceptable:
- stale data poisons candidate quality and lowers completion rate.
Good:
- Mark stale if
now - lastLocTsexceeds threshold (e.g., 10-15s). - Why good:
- quick filter removes obviously stale candidates before offer.
- Example:
- any driver older than 12s heartbeat is removed from eligible set.
Great:
- Freshness score in ranking + heartbeat gating for offer eligibility.
- How with numbers:
- if 500k updates/s ingest drops in a region, freshness gating prevents false positives from stale entries.
- Additional mechanism:
- keep hard cutoff for eligibility plus soft freshness penalty in ranking.
- Tradeoff:
- overly strict thresholds can reduce supply visibility in weak-network zones.
Deep Dive 3: Offer acceptance race
Next, test the concurrency edge where multiple drivers may accept same request.
Bad:
- First accept wins without atomic guard.
- How: duplicate acceptance can create inconsistent trip assignment.
- Example:
- two offers delivered nearly simultaneously due to retry/rebalance. - both drivers tap accept within 100ms.
- Why bad:
- without atomic state transition, both accept paths may proceed.
- Consequence:
- double assignment, cancellation churn, and support incidents.
Good:
- Atomic compare-and-set on request/trip state (
REQUESTED -> MATCHEDonce). - Why good:
- only one write can successfully transition state.
- Example:
- first accept updates state version; second receives conflict and is rejected.
Great:
- CAS + offer token + expiry window.
- Why: only valid non-expired offer can transition state.
- How technically:
- token ties accept call to a specific dispatch attempt, - expired token blocks late accepts after reassignment.
- Why this solves:
- protects against out-of-order network delivery and delayed clients.
- Tradeoff: extra state tracking.
Deep Dive 4: Realtime channel failure
Now evaluate what happens when websocket delivery degrades.
Bad:
- No fallback when driver socket is disconnected.
- Why bad:
- dispatch waits for timeout before moving to next candidate, burning dispatch budget.
- Example:
- driver socket dropped; offer never reaches device; rider waits full timeout repeatedly.
- How technically:
- each failed attempt consumes offer TTL and increases p95 match latency.
Good:
- Detect failed push and move to next candidate quickly.
- Why good:
- reduces wasted time on unreachable candidates.
- Example:
- if no delivery ack in 300ms, fail fast to next ranked driver.
Great:
- Multi-channel fallback (websocket + push wakeup + re-offer queue) with bounded retries.
- How technically:
- websocket primary, - push wakeup for sleeping app, - re-offer queue prevents infinite loops.
- Why this solves:
- keeps dispatch alive across mobile network churn.
- Tradeoff: more moving parts, but higher dispatch success under network churn.
Deep Dive 5: Multi-region dispatch
Finally, verify region behavior for latency and resilience.
Bad:
- Global single-region dispatch.
- Why bad: high RTT and large outage blast radius.
- Example:
- APAC rider and driver matched through US control plane during peak.
- How technically:
- extra round trips inflate offer latency and timeout risk, - regional outage can pause global matching.
Good:
- Region-local dispatch control plane.
- Why good:
- keeps request, candidate search, and offer loop close to users.
- Example:
- city-region dispatch can keep matching decisions within local latency envelope.
Great:
- Region-local matching + async cross-region failover only when local supply exhausted.
- How technically:
- default local dispatch, - cross-region expansion only after bounded local attempts.
- Why this solves:
- preserves low latency first and uses failover as controlled fallback.
- Tradeoff: more operational complexity and cross-region consistency decisions.
---
7) Common mistakes
- Treating geospatial match as simple nearest-point query without candidate/ranking pipeline.
- No stale-location filtering.
- Missing idempotency for ride request and driver accept.
- No atomic state transition guard for accept race.
- Over-coupling ETA/ranking dependencies to dispatch availability.
---
8) What interviewer is evaluating
- Did you separate high-churn realtime state from authoritative trip lifecycle state?
- Did you map throughput numbers to matching/indexing choices?
- Did you handle offer race and stale-driver failure modes explicitly?
- Did you design graceful degraded modes instead of brittle hard dependencies?
---
9) References
- RFC 6455 WebSocket: https://www.rfc-editor.org/rfc/rfc6455
- Uber H3 overview: https://www.uber.com/en-CA/blog/h3/
- H3 docs: https://uber.github.io/h3/
- Redis GEOSEARCH docs: https://redis.io/docs/latest/commands/geosearch/
- PostGIS KNN/ST_DWithin guides: https://www.postgis.net/workshops/postgis-intro/knn.html
- RFC 7231 (
202 Accepted): https://www.rfc-editor.org/rfc/rfc7231.html
Key Takeaways
- 1Rider can request a ride with pickup/drop.
- 2Nearby drivers are matched and one is assigned.
- 3Both sides can see realtime trip/location updates.
- 4Trip lifecycle is tracked (`REQUESTED -> MATCHED -> ON_TRIP -> COMPLETED/CANCELLED`).
- 5Match decision p95 < 2s in hot cities.
Continue Learning
Location Services
PRODesign a Nearby Discovery Platform (Users + Places + Events, Privacy-Aware)
Nearby discovery looks simple at product level ("show me what is near me"), but the hard part is balancing low-latency relevance with strong location privacy guarantees. The system must answer geo ...
Location Services
PRODesign a Google Maps-like Platform (Map Tiles + Routing + ETA + Traffic Updates)
A maps platform is not one system. It is four coupled systems with different physics: low-latency map rendering (tiles), compute-heavy path search (routing), uncertainty-aware travel prediction (ET...
Web Services
Design a URL Shortener
Most candidates fail this problem not because URL shortening is hard, but because they over-design too early or give choices without proving them.