High-Velocity CQRS Bidding Service — Design Document
Table of Contents
1. Overview
This document describes the design of a high-velocity bidding API built as a learning project to explore the Command Query Responsibility Segregation (CQRS) pattern under realistic production load.
The system separates write operations (bid placements) from read operations (highest bid queries) to allow each path to be optimised independently. Redis sits at the core of both paths for speed. PostgreSQL serves as the durable source of truth, written to asynchronously. RabbitMQ decouples persistence and real-time notification from the API's critical path. A Vue 3 frontend receives live bid updates over WebSockets.
2. Goals and Requirements
Functional Requirements
| # | Requirement |
|---|---|
| F1 | Clients can place bids on items via a REST API. |
| F2 | A bid is rejected if it does not exceed the current highest bid for the item. |
| F3 | Clients can query the current highest bid for any item. |
| F4 | Connected browser clients receive real-time updates when a new bid is placed. |
| F5 | All accepted bids are durably persisted in PostgreSQL. |
Non-Functional Requirements
| # | Requirement | Target |
|---|---|---|
| N1 | Write throughput | 50–100 bids/second |
| N2 | Read throughput | 500 queries/second |
| N3 | Read latency | Sub-millisecond (P99) |
| N4 | Bid validation | Atomic (no race conditions between concurrent bids) |
| N5 | Persistence | At-least-once delivery; idempotent inserts |
| N6 | Real-time delivery | No duplicate notifications to browser on retry |
3. System Architecture
3.1 CQRS Split
The architecture enforces a strict CQRS boundary:
Command path —
POST /api/v1/bidsvalidates and records a bid atomically in Redis, then publishes an event. Response is202 Accepted; Postgres is not involved.Query path —
GET /api/v1/items/:id/bids/highestreads directly from Redis. Postgres is not involved unless Redis is cold (cache miss recovery, described in §3.4).
This split keeps both paths at their theoretical performance ceiling: Redis handles validation at write speed; Redis handles queries at memory-read speed.
3.2 Write Path (Command)
Client
│
▼ POST /api/v1/bids
┌─────────────────┐
│ bidding-api │
│ (Express 5) │
└────────┬────────┘
│ Lua script (atomic)
▼
┌─────────┐
│ Redis │ ← single source of bid state
└────┬────┘
│ PUBLISH bid.placed
▼
┌──────────────┐
│ RabbitMQ │ (bids_exchange, topic)
└──┬───────────┘
│
├──── bid.placed ──▶ bids_persistence_queue ──▶ bidding-worker ──▶ PostgreSQL
│
└──── bid.placed ──▶ bids_realtime_queue ──▶ bidding-gateway ──▶ Socket.io ──▶ Browser
Step-by-step:
Client sends
POST /api/v1/bidswith{ itemId, userId, amount }.The API executes a Redis Lua script atomically:
Reads the current highest bid for the item.
If
amount <= currentHighest, returns a rejection; API responds409 Conflict.If valid, writes the new highest bid to Redis; API responds
202 Accepted.
The API publishes a
BidPlacedevent to the RabbitMQ topic exchange (bids_exchange) with routing keybid.placed.The API returns
202 Acceptedto the client. The response does not wait for Postgres.
3.3 Async Persistence Path
bids_persistence_queue
│
▼
bidding-worker
(RabbitMQ consumer)
│
├── success ──▶ INSERT INTO bids (ON CONFLICT DO NOTHING) ──▶ ack
│
└── failure ──▶ retry (up to 3×, x-retry-count header)
│
└── max retries exceeded ──▶ nack
│
▼
bids_dlx (fanout)
│
▼
bids_dead_letter_queue
│
▼
bidding-dlq-worker
(exponential backoff:
1s → 2s → 4s → … → 60s cap,
retries until Postgres recovers)
Key properties:
Retry routing key — Retried messages use
bid.persist.retry(notbid.placed), sobids_realtime_queueis never bound to this key. The browser receives exactly one notification per accepted bid regardless of how many persistence retries occur.Idempotency — The
INSERTusesON CONFLICT DO NOTHINGagainst the unique constraint on(item_id, created_at). Any number of retries produces exactly one row.Dead-letter preservation — Messages that exhaust all retries are never dropped; they accumulate in
bids_dead_letter_queueand are retried indefinitely by the DLQ worker once Postgres recovers.
3.4 Read Path (Query) and Cache Miss Recovery
Happy path:GET /api/v1/items/:id/bids/highest reads the item's current highest bid from Redis. No Postgres query; latency is sub-millisecond.
Cache miss path (thundering herd prevention):
When Redis is cold (e.g., after a restart that outpaced AOF replay), multiple concurrent requests for the same item would all miss and hit Postgres simultaneously — the thundering herd problem.
Mitigation:
On a cache miss, the API attempts to acquire a Redis distributed lock (
SET NX, 5 s TTL) for the item key.The lock winner queries Postgres and populates Redis.
All other concurrent requests poll Redis every 50 ms for up to 2 s, then respond with the populated value.
Only one Postgres query is ever issued per item per cold-start event.
3.5 Real-Time Notification Path
bids_realtime_queue
│
▼
bidding-gateway (port 3001)
(RabbitMQ consumer + Socket.io server)
│
▼ io.emit('bid_updated', bid)
All connected browser clients (Vue 3 dashboard)
The gateway is intentionally decoupled from the API:
The API does not manage WebSocket connections, keeping it stateless under high write load.
The gateway can be restarted independently without affecting bid acceptance.
Additional consumers (analytics, fraud detection) can bind to
bid.placedwith zero changes to the publisher.
3.6 Redis Warm-Up on API Startup
Before the API begins accepting traffic, it runs a warm-up query against Postgres to pre-populate Redis with the current highest bid per item:
SELECT DISTINCT ON (item_id)
item_id, user_id, amount, created_at
FROM bids
ORDER BY item_id, amount DESC;
This ensures correct reads immediately after a Redis restart, complementing Redis AOF persistence (which handles clean restarts but may lag under crash recovery).
3.7 RabbitMQ Routing Key Design
| Routing Key | Bound Queues | Purpose |
|---|---|---|
bid.placed |
bids_persistence_queue, bids_realtime_queue |
New accepted bids |
bid.persist.retry |
bids_persistence_queue only |
Persistence retries (invisible to UI) |
| (dead-lettered) | bids_dead_letter_queue (via bids_dlx fanout) |
Failed messages awaiting DLQ worker |
4. Component Inventory
| Service | Port | Technology | Responsibility |
|---|---|---|---|
bidding-api |
3000 | Node.js 20, Express 5, TypeScript | Command + Query endpoints; Lua-based Redis validation; RabbitMQ publisher |
bidding-worker |
— | Node.js 20, TypeScript | RabbitMQ consumer → PostgreSQL persistence; retry logic |
bidding-dlq-worker |
— | Node.js 20, TypeScript | Dead-letter queue consumer; exponential backoff retry |
bidding-gateway |
3001 | Node.js 20, Socket.io 4 | RabbitMQ consumer → WebSocket broadcast |
bidding-ui |
8080 | Vue 3, Vite | Live bidding dashboard; Socket.io client |
bidding-postgres |
5432 | PostgreSQL 15 | Durable source of truth |
bidding-redis |
6379 | Redis 7 (AOF enabled) | Write validation state; read cache; distributed locks |
bidding-rabbitmq |
5672 / 15672 | RabbitMQ 3 (topic exchange) | Event bus; decouples write, persistence, and real-time paths |
bidding-elasticsearch |
9200 | Elasticsearch | Log storage and search |
bidding-logstash |
5044 | Logstash | Log ingestion pipeline |
bidding-kibana |
5601 | Kibana | Log visualisation and dashboards |
Orchestration: Docker Compose / Podman Compose.
5. Data Model
5.1 PostgreSQL Schema
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE items (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
starting_price NUMERIC(12, 2) NOT NULL
);
CREATE TABLE bids (
id SERIAL PRIMARY KEY,
item_id INTEGER NOT NULL REFERENCES items(id),
user_id INTEGER NOT NULL REFERENCES users(id),
amount NUMERIC(12, 2) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Query performance
CREATE INDEX idx_bids_item_id ON bids (item_id);
CREATE INDEX idx_bids_created_at ON bids (created_at DESC);
CREATE INDEX idx_bids_item_amount ON bids (item_id, amount DESC);
-- Idempotency for retry safety
ALTER TABLE bids
ADD CONSTRAINT uq_bids_item_created_at UNIQUE (item_id, created_at);
5.2 Redis Key Schema
| Key pattern | Type | Value | TTL |
|---|---|---|---|
bid:highest:{itemId} |
Hash | { userId, amount, createdAt } |
None (AOF-persisted) |
bid:lock:{itemId} |
String | Lock token | 5 s |
6. API Reference
POST /api/v1/bids — Place a bid
Request body:
{
"itemId": 1,
"userId": 42,
"amount": 150.00
}
Responses:
| Status | Condition |
|---|---|
202 Accepted |
Bid is valid and recorded in Redis; async persistence queued |
400 Bad Request |
Missing or invalid fields |
409 Conflict |
amount does not exceed the current highest bid |
500 Internal Server Error |
Redis unavailable or unexpected error |
GET /api/v1/items/:id/bids/highest — Get highest bid
Response (200 OK):
{
"itemId": 1,
"userId": 42,
"amount": 150.00,
"createdAt": "2026-05-06T10:00:00.000Z"
}
Responses:
| Status | Condition |
|---|---|
200 OK |
Highest bid returned from Redis |
404 Not Found |
No bids exist for this item |
500 Internal Server Error |
Redis unavailable |
GET /health — Health check
Returns 200 OK with a basic status payload. Used by Docker health checks and load balancers.
7. Key Design Decisions
7.1 Atomic bid validation via Redis Lua script
Problem: Two concurrent requests could both read the same highest bid, both pass the amount > current check, and both write — allowing a lower bid to overwrite a higher one.
Decision: Validate bids inside a Redis Lua script. Lua scripts in Redis execute as a single atomic operation: the read, comparison, and conditional write are indivisible. No locking primitives are needed at the application level.
Trade-off: Lua scripts are harder to test and debug than application code. The logic is intentionally minimal: read, compare, write or reject.
7.2 RabbitMQ topic exchange over direct queues
Problem: A direct queue couples the publisher to a specific consumer. Adding a new consumer (analytics, fraud detection) would require changing the publisher.
Decision: Use a topic exchange with routing key patterns. Any new consumer declares a queue and binds it to bid.placed — zero changes to the publisher.
Trade-off: Topic exchanges are slightly more complex to configure and reason about than direct queues. The routing key table in §3.7 must be kept current as new consumers are added.
7.3 Decoupled real-time gateway
Problem: Managing WebSocket connections inside the high-throughput API creates unnecessary coupling and resource contention.
Decision: The real-time gateway (bidding-gateway) is a separate service. It consumes bids_realtime_queue and broadcasts to Socket.io clients. The API has no awareness of WebSocket state.
Benefits: The API remains stateless; the gateway scales independently; a gateway crash does not affect bid acceptance; the gateway can be restarted mid-auction without data loss (bids are preserved in RabbitMQ).
7.4 Redis-first over Postgres-first with outbox pattern
Problem: The outbox pattern guarantees consistency by writing the event to Postgres transactionally before publishing to RabbitMQ. However, it adds Postgres to the hot write path.
Decision: Write to Redis first (fast, atomic), publish to RabbitMQ, and persist to Postgres asynchronously. The DLQ worker closes the consistency window if Postgres is temporarily unavailable.
Trade-off: There is a window between Redis write and Postgres persistence during which a Redis crash could lose a bid. This is mitigated by Redis AOF persistence and the startup warm-up query, but the outbox pattern would offer stronger guarantees if Postgres write latency were acceptable.
7.5 Thundering herd prevention on cache miss
Problem: After a Redis cold start, all concurrent reads for a popular item miss the cache and issue simultaneous Postgres queries — potentially hundreds per second.
Decision: Implement a Redis distributed lock (SET NX, 5 s TTL) on cache miss. Only the lock winner queries Postgres. All other requests poll Redis every 50 ms for up to 2 s.
Trade-off: Adds 0–2 s latency to the first request per item after a cold start. Acceptable given cold starts are infrequent and the alternative (Postgres overload) is worse.
8. Reliability and Resilience
8.1 RabbitMQ Connection Retry
Application containers do not wait for RabbitMQ to pass its health check before starting (container orchestrators treat health checks as hints, not guarantees). The connectRabbitMQ() function implements a retry loop: 10 attempts, 3 s apart, with full error logging. This is the only mechanism that reliably works across Docker Compose, Podman Compose, and Kubernetes.
8.2 Redis AOF Persistence
Redis is started with --appendonly yes. Every write is appended to the AOF log before acknowledgement. On restart, Redis replays the log to restore full state. Combined with the startup warm-up query from Postgres, Redis restarts do not result in data loss or stale reads.
8.3 Redis Command Timeout
(Identified; patch pending — see §11)
Without a command timeout, Redis client calls hang indefinitely during an outage. The fix is to configure socket: { commandTimeout: 3000 } in the Redis client, converting infinite hangs into fast 500 errors and allowing circuit-breaking behaviour at the load balancer.
8.4 Dead-Letter Queue and DLQ Worker
Messages that exhaust all persistence retries are nacked into bids_dlx (a fanout dead-letter exchange) and accumulate in bids_dead_letter_queue. A dedicated DLQ worker consumes from this queue and retries with exponential backoff (1 s → 2 s → 4 s → … capped at 60 s) until Postgres recovers. No accepted bid is ever silently dropped.
8.5 Idempotent Persistence
The INSERT INTO bids statement uses ON CONFLICT DO NOTHING against the uq_bids_item_created_at unique constraint. Any number of retries for the same bid produces exactly one row.
8.6 Graceful Shutdown
All services register SIGTERM and SIGINT handlers that close RabbitMQ channels, Redis connections, and Postgres pools cleanly before exiting. This prevents in-flight messages from being nacked due to a sudden connection drop.
8.7 Docker Health Checks
Each infrastructure service exposes a health check:
| Service | Check |
|---|---|
| PostgreSQL | pg_isready |
| Redis | redis-cli ping |
| RabbitMQ | rabbitmq-diagnostics ping |
These drive the Compose dependency graph and enable orchestrators to delay traffic routing until services are genuinely ready.
9. Observability
Logging
All services emit structured JSON logs via Pino, chosen for its low-overhead serialisation and native JSON output.
Log pipeline: Pino → Logstash → Elasticsearch → Kibana (ELK)
Recommended log fields per bid event:
| Field | Description |
|---|---|
itemId |
Item receiving the bid |
userId |
Bidder |
amount |
Bid amount |
result |
accepted / rejected_too_low / error |
latencyMs |
End-to-end request duration |
retryCount |
Present on persistence retry events |
Load Testing
Load tests are written in K6. Current test profile: ramp to target RPS against a single item (itemId: 1). See §11 for a known gap in test coverage.
10. Incident Log
Incident 1 — All app containers fail on startup (ECONNREFUSED :5672)
| Field | Detail |
|---|---|
| Symptom | All application containers exit immediately on startup with ECONNREFUSED on port 5672. |
| Root cause | podman-compose does not honour condition: service_healthy in depends_on. Application containers started before RabbitMQ was ready to accept connections. |
| Fix | Retry loop inside connectRabbitMQ(): 10 attempts, 3 s apart, with structured logging on each failure. |
| Lesson | Health checks in Compose files are orchestrator hints, not hard barriers. Retry logic inside the application is the only mechanism that works reliably across all environments (Compose, Podman, Kubernetes). |
Incident 2 — CORS error blocks bid placement from the browser
| Field | Detail |
|---|---|
| Symptom | POST /api/v1/bids succeeds from curl but fails in the browser with a CORS error (localhost:8080 → localhost:3000). |
| Root cause | No CORS middleware was configured on the Express API. |
| Fix | Added Access-Control-Allow-Origin, Access-Control-Allow-Methods, and Access-Control-Allow-Headers response headers, plus an OPTIONS pre-flight handler returning 204 No Content. |
| Lesson | curl does not enforce the Same-Origin Policy. CORS issues are only visible from a real browser origin. Always run integration tests from the actual client. |
Incident 3 — Redis outage causes 100% error rate and request hangs
| Field | Detail |
|---|---|
| Symptom | All bid requests returned HTTP 500 for 12 minutes during a simulated Redis outage. In-flight requests hung indefinitely (no timeout). |
| Root cause (errors) | Redis unavailable; no command timeout configured, so client calls blocked indefinitely. |
| Root cause (12 min duration) | Redis AOF log replay took ~12 minutes after restart due to accumulated log size; until replay completed, state was unavailable. |
| Recovery | Redis restarted; AOF log replayed; full state restored. Auto-reconnect in the Redis client resumed traffic within ~3 s of Postgres availability. |
| Fix applied | Redis AOF persistence confirmed as sufficient for state recovery. |
| Fix pending | Configure socket: { commandTimeout: 3000 } to convert indefinite hangs into fast 500s, allowing upstream circuit breakers and load balancers to route around the failure. |
| Lesson | A command timeout is not a performance optimisation — it is a correctness requirement. An unbounded hang is operationally equivalent to a deadlock. |
11. Known Gaps and Future Work
| # | Gap | Priority | Notes |
|---|---|---|---|
| G1 | No authentication | High | userId is accepted directly from the client with no verification. Any user can bid as any other user. Requires JWT or session-based auth. |
| G2 | No Redis command timeout | High | Identified in Incident 3. Fix: socket: { commandTimeout: 3000 }. Converts hangs to fast failures. |
| G3 | No circuit breaker | Medium | The API continues calling Redis during an extended outage, exhausting connection pools. A circuit breaker (e.g., opossum) would open after a threshold of failures and return 503 immediately, protecting downstream resources. |
| G4 | Real-time queue has no message TTL | Medium | On gateway restart, RabbitMQ delivers all queued messages at once, flooding connected clients with stale bid updates. A message TTL (e.g., 30 s) would discard messages that are no longer actionable. |
| G5 | Load test targets only one item | Low | All K6 load is directed at itemId: 1. Parallel writes across multiple items would expose any per-key Redis contention or queue routing issues. |
| G6 | No auction lifecycle | Low | Items accept bids indefinitely. A production system would need auction start/end times, a scheduler to close auctions, and appropriate state transitions. |