Skip to main content

Command Palette

Search for a command to run...

High-Velocity CQRS Bidding Service — Design Document

Updated
17 min read

Table of Contents

  1. Overview

  2. Goals and Requirements

  3. System Architecture

  4. Component Inventory

  5. Data Model

  6. API Reference

  7. Key Design Decisions

  8. Reliability and Resilience

  9. Observability

  10. Incident Log

  11. Known Gaps and Future Work


1. Overview

This document describes the design of a high-velocity bidding API built as a learning project to explore the Command Query Responsibility Segregation (CQRS) pattern under realistic production load.

The system separates write operations (bid placements) from read operations (highest bid queries) to allow each path to be optimised independently. Redis sits at the core of both paths for speed. PostgreSQL serves as the durable source of truth, written to asynchronously. RabbitMQ decouples persistence and real-time notification from the API's critical path. A Vue 3 frontend receives live bid updates over WebSockets.


2. Goals and Requirements

Functional Requirements

# Requirement
F1 Clients can place bids on items via a REST API.
F2 A bid is rejected if it does not exceed the current highest bid for the item.
F3 Clients can query the current highest bid for any item.
F4 Connected browser clients receive real-time updates when a new bid is placed.
F5 All accepted bids are durably persisted in PostgreSQL.

Non-Functional Requirements

# Requirement Target
N1 Write throughput 50–100 bids/second
N2 Read throughput 500 queries/second
N3 Read latency Sub-millisecond (P99)
N4 Bid validation Atomic (no race conditions between concurrent bids)
N5 Persistence At-least-once delivery; idempotent inserts
N6 Real-time delivery No duplicate notifications to browser on retry

3. System Architecture

3.1 CQRS Split

The architecture enforces a strict CQRS boundary:

  • Command pathPOST /api/v1/bids validates and records a bid atomically in Redis, then publishes an event. Response is 202 Accepted; Postgres is not involved.

  • Query pathGET /api/v1/items/:id/bids/highest reads directly from Redis. Postgres is not involved unless Redis is cold (cache miss recovery, described in §3.4).

This split keeps both paths at their theoretical performance ceiling: Redis handles validation at write speed; Redis handles queries at memory-read speed.


3.2 Write Path (Command)

Client
  │
  ▼ POST /api/v1/bids
┌─────────────────┐
│   bidding-api   │
│  (Express 5)    │
└────────┬────────┘
         │ Lua script (atomic)
         ▼
    ┌─────────┐
    │  Redis  │  ← single source of bid state
    └────┬────┘
         │ PUBLISH bid.placed
         ▼
  ┌──────────────┐
  │   RabbitMQ   │  (bids_exchange, topic)
  └──┬───────────┘
     │
     ├──── bid.placed ──▶ bids_persistence_queue ──▶ bidding-worker ──▶ PostgreSQL
     │
     └──── bid.placed ──▶ bids_realtime_queue    ──▶ bidding-gateway ──▶ Socket.io ──▶ Browser

Step-by-step:

  1. Client sends POST /api/v1/bids with { itemId, userId, amount }.

  2. The API executes a Redis Lua script atomically:

    • Reads the current highest bid for the item.

    • If amount <= currentHighest, returns a rejection; API responds 409 Conflict.

    • If valid, writes the new highest bid to Redis; API responds 202 Accepted.

  3. The API publishes a BidPlaced event to the RabbitMQ topic exchange (bids_exchange) with routing key bid.placed.

  4. The API returns 202 Accepted to the client. The response does not wait for Postgres.


3.3 Async Persistence Path

bids_persistence_queue
        │
        ▼
  bidding-worker
   (RabbitMQ consumer)
        │
        ├── success ──▶ INSERT INTO bids (ON CONFLICT DO NOTHING) ──▶ ack
        │
        └── failure ──▶ retry (up to 3×, x-retry-count header)
                            │
                            └── max retries exceeded ──▶ nack
                                        │
                                        ▼
                                   bids_dlx (fanout)
                                        │
                                        ▼
                               bids_dead_letter_queue
                                        │
                                        ▼
                                bidding-dlq-worker
                               (exponential backoff:
                                1s → 2s → 4s → … → 60s cap,
                                retries until Postgres recovers)

Key properties:

  • Retry routing key — Retried messages use bid.persist.retry (not bid.placed), so bids_realtime_queue is never bound to this key. The browser receives exactly one notification per accepted bid regardless of how many persistence retries occur.

  • Idempotency — The INSERT uses ON CONFLICT DO NOTHING against the unique constraint on (item_id, created_at). Any number of retries produces exactly one row.

  • Dead-letter preservation — Messages that exhaust all retries are never dropped; they accumulate in bids_dead_letter_queue and are retried indefinitely by the DLQ worker once Postgres recovers.


3.4 Read Path (Query) and Cache Miss Recovery

Happy path:
GET /api/v1/items/:id/bids/highest reads the item's current highest bid from Redis. No Postgres query; latency is sub-millisecond.

Cache miss path (thundering herd prevention):
When Redis is cold (e.g., after a restart that outpaced AOF replay), multiple concurrent requests for the same item would all miss and hit Postgres simultaneously — the thundering herd problem.

Mitigation:

  1. On a cache miss, the API attempts to acquire a Redis distributed lock (SET NX, 5 s TTL) for the item key.

  2. The lock winner queries Postgres and populates Redis.

  3. All other concurrent requests poll Redis every 50 ms for up to 2 s, then respond with the populated value.

  4. Only one Postgres query is ever issued per item per cold-start event.


3.5 Real-Time Notification Path

bids_realtime_queue
        │
        ▼
  bidding-gateway (port 3001)
  (RabbitMQ consumer + Socket.io server)
        │
        ▼ io.emit('bid_updated', bid)
  All connected browser clients (Vue 3 dashboard)

The gateway is intentionally decoupled from the API:

  • The API does not manage WebSocket connections, keeping it stateless under high write load.

  • The gateway can be restarted independently without affecting bid acceptance.

  • Additional consumers (analytics, fraud detection) can bind to bid.placed with zero changes to the publisher.


3.6 Redis Warm-Up on API Startup

Before the API begins accepting traffic, it runs a warm-up query against Postgres to pre-populate Redis with the current highest bid per item:

SELECT DISTINCT ON (item_id)
    item_id, user_id, amount, created_at
FROM bids
ORDER BY item_id, amount DESC;

This ensures correct reads immediately after a Redis restart, complementing Redis AOF persistence (which handles clean restarts but may lag under crash recovery).


3.7 RabbitMQ Routing Key Design

Routing Key Bound Queues Purpose
bid.placed bids_persistence_queue, bids_realtime_queue New accepted bids
bid.persist.retry bids_persistence_queue only Persistence retries (invisible to UI)
(dead-lettered) bids_dead_letter_queue (via bids_dlx fanout) Failed messages awaiting DLQ worker

4. Component Inventory

Service Port Technology Responsibility
bidding-api 3000 Node.js 20, Express 5, TypeScript Command + Query endpoints; Lua-based Redis validation; RabbitMQ publisher
bidding-worker Node.js 20, TypeScript RabbitMQ consumer → PostgreSQL persistence; retry logic
bidding-dlq-worker Node.js 20, TypeScript Dead-letter queue consumer; exponential backoff retry
bidding-gateway 3001 Node.js 20, Socket.io 4 RabbitMQ consumer → WebSocket broadcast
bidding-ui 8080 Vue 3, Vite Live bidding dashboard; Socket.io client
bidding-postgres 5432 PostgreSQL 15 Durable source of truth
bidding-redis 6379 Redis 7 (AOF enabled) Write validation state; read cache; distributed locks
bidding-rabbitmq 5672 / 15672 RabbitMQ 3 (topic exchange) Event bus; decouples write, persistence, and real-time paths
bidding-elasticsearch 9200 Elasticsearch Log storage and search
bidding-logstash 5044 Logstash Log ingestion pipeline
bidding-kibana 5601 Kibana Log visualisation and dashboards

Orchestration: Docker Compose / Podman Compose.


5. Data Model

5.1 PostgreSQL Schema

CREATE TABLE users (
    id   SERIAL PRIMARY KEY,
    name TEXT NOT NULL
);

CREATE TABLE items (
    id             SERIAL PRIMARY KEY,
    name           TEXT NOT NULL,
    starting_price NUMERIC(12, 2) NOT NULL
);

CREATE TABLE bids (
    id         SERIAL PRIMARY KEY,
    item_id    INTEGER NOT NULL REFERENCES items(id),
    user_id    INTEGER NOT NULL REFERENCES users(id),
    amount     NUMERIC(12, 2) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Query performance
CREATE INDEX idx_bids_item_id       ON bids (item_id);
CREATE INDEX idx_bids_created_at    ON bids (created_at DESC);
CREATE INDEX idx_bids_item_amount   ON bids (item_id, amount DESC);

-- Idempotency for retry safety
ALTER TABLE bids
    ADD CONSTRAINT uq_bids_item_created_at UNIQUE (item_id, created_at);

5.2 Redis Key Schema

Key pattern Type Value TTL
bid:highest:{itemId} Hash { userId, amount, createdAt } None (AOF-persisted)
bid:lock:{itemId} String Lock token 5 s

6. API Reference

POST /api/v1/bids — Place a bid

Request body:

{
  "itemId": 1,
  "userId": 42,
  "amount": 150.00
}

Responses:

Status Condition
202 Accepted Bid is valid and recorded in Redis; async persistence queued
400 Bad Request Missing or invalid fields
409 Conflict amount does not exceed the current highest bid
500 Internal Server Error Redis unavailable or unexpected error

GET /api/v1/items/:id/bids/highest — Get highest bid

Response (200 OK):

{
  "itemId": 1,
  "userId": 42,
  "amount": 150.00,
  "createdAt": "2026-05-06T10:00:00.000Z"
}

Responses:

Status Condition
200 OK Highest bid returned from Redis
404 Not Found No bids exist for this item
500 Internal Server Error Redis unavailable

GET /health — Health check

Returns 200 OK with a basic status payload. Used by Docker health checks and load balancers.


7. Key Design Decisions

7.1 Atomic bid validation via Redis Lua script

Problem: Two concurrent requests could both read the same highest bid, both pass the amount > current check, and both write — allowing a lower bid to overwrite a higher one.

Decision: Validate bids inside a Redis Lua script. Lua scripts in Redis execute as a single atomic operation: the read, comparison, and conditional write are indivisible. No locking primitives are needed at the application level.

Trade-off: Lua scripts are harder to test and debug than application code. The logic is intentionally minimal: read, compare, write or reject.


7.2 RabbitMQ topic exchange over direct queues

Problem: A direct queue couples the publisher to a specific consumer. Adding a new consumer (analytics, fraud detection) would require changing the publisher.

Decision: Use a topic exchange with routing key patterns. Any new consumer declares a queue and binds it to bid.placed — zero changes to the publisher.

Trade-off: Topic exchanges are slightly more complex to configure and reason about than direct queues. The routing key table in §3.7 must be kept current as new consumers are added.


7.3 Decoupled real-time gateway

Problem: Managing WebSocket connections inside the high-throughput API creates unnecessary coupling and resource contention.

Decision: The real-time gateway (bidding-gateway) is a separate service. It consumes bids_realtime_queue and broadcasts to Socket.io clients. The API has no awareness of WebSocket state.

Benefits: The API remains stateless; the gateway scales independently; a gateway crash does not affect bid acceptance; the gateway can be restarted mid-auction without data loss (bids are preserved in RabbitMQ).


7.4 Redis-first over Postgres-first with outbox pattern

Problem: The outbox pattern guarantees consistency by writing the event to Postgres transactionally before publishing to RabbitMQ. However, it adds Postgres to the hot write path.

Decision: Write to Redis first (fast, atomic), publish to RabbitMQ, and persist to Postgres asynchronously. The DLQ worker closes the consistency window if Postgres is temporarily unavailable.

Trade-off: There is a window between Redis write and Postgres persistence during which a Redis crash could lose a bid. This is mitigated by Redis AOF persistence and the startup warm-up query, but the outbox pattern would offer stronger guarantees if Postgres write latency were acceptable.


7.5 Thundering herd prevention on cache miss

Problem: After a Redis cold start, all concurrent reads for a popular item miss the cache and issue simultaneous Postgres queries — potentially hundreds per second.

Decision: Implement a Redis distributed lock (SET NX, 5 s TTL) on cache miss. Only the lock winner queries Postgres. All other requests poll Redis every 50 ms for up to 2 s.

Trade-off: Adds 0–2 s latency to the first request per item after a cold start. Acceptable given cold starts are infrequent and the alternative (Postgres overload) is worse.


8. Reliability and Resilience

8.1 RabbitMQ Connection Retry

Application containers do not wait for RabbitMQ to pass its health check before starting (container orchestrators treat health checks as hints, not guarantees). The connectRabbitMQ() function implements a retry loop: 10 attempts, 3 s apart, with full error logging. This is the only mechanism that reliably works across Docker Compose, Podman Compose, and Kubernetes.

8.2 Redis AOF Persistence

Redis is started with --appendonly yes. Every write is appended to the AOF log before acknowledgement. On restart, Redis replays the log to restore full state. Combined with the startup warm-up query from Postgres, Redis restarts do not result in data loss or stale reads.

8.3 Redis Command Timeout

(Identified; patch pending — see §11)
Without a command timeout, Redis client calls hang indefinitely during an outage. The fix is to configure socket: { commandTimeout: 3000 } in the Redis client, converting infinite hangs into fast 500 errors and allowing circuit-breaking behaviour at the load balancer.

8.4 Dead-Letter Queue and DLQ Worker

Messages that exhaust all persistence retries are nacked into bids_dlx (a fanout dead-letter exchange) and accumulate in bids_dead_letter_queue. A dedicated DLQ worker consumes from this queue and retries with exponential backoff (1 s → 2 s → 4 s → … capped at 60 s) until Postgres recovers. No accepted bid is ever silently dropped.

8.5 Idempotent Persistence

The INSERT INTO bids statement uses ON CONFLICT DO NOTHING against the uq_bids_item_created_at unique constraint. Any number of retries for the same bid produces exactly one row.

8.6 Graceful Shutdown

All services register SIGTERM and SIGINT handlers that close RabbitMQ channels, Redis connections, and Postgres pools cleanly before exiting. This prevents in-flight messages from being nacked due to a sudden connection drop.

8.7 Docker Health Checks

Each infrastructure service exposes a health check:

Service Check
PostgreSQL pg_isready
Redis redis-cli ping
RabbitMQ rabbitmq-diagnostics ping

These drive the Compose dependency graph and enable orchestrators to delay traffic routing until services are genuinely ready.


9. Observability

Logging

All services emit structured JSON logs via Pino, chosen for its low-overhead serialisation and native JSON output.

Log pipeline: Pino → Logstash → Elasticsearch → Kibana (ELK)

Recommended log fields per bid event:

Field Description
itemId Item receiving the bid
userId Bidder
amount Bid amount
result accepted / rejected_too_low / error
latencyMs End-to-end request duration
retryCount Present on persistence retry events

Load Testing

Load tests are written in K6. Current test profile: ramp to target RPS against a single item (itemId: 1). See §11 for a known gap in test coverage.


10. Incident Log

Incident 1 — All app containers fail on startup (ECONNREFUSED :5672)

Field Detail
Symptom All application containers exit immediately on startup with ECONNREFUSED on port 5672.
Root cause podman-compose does not honour condition: service_healthy in depends_on. Application containers started before RabbitMQ was ready to accept connections.
Fix Retry loop inside connectRabbitMQ(): 10 attempts, 3 s apart, with structured logging on each failure.
Lesson Health checks in Compose files are orchestrator hints, not hard barriers. Retry logic inside the application is the only mechanism that works reliably across all environments (Compose, Podman, Kubernetes).

Incident 2 — CORS error blocks bid placement from the browser

Field Detail
Symptom POST /api/v1/bids succeeds from curl but fails in the browser with a CORS error (localhost:8080 → localhost:3000).
Root cause No CORS middleware was configured on the Express API.
Fix Added Access-Control-Allow-Origin, Access-Control-Allow-Methods, and Access-Control-Allow-Headers response headers, plus an OPTIONS pre-flight handler returning 204 No Content.
Lesson curl does not enforce the Same-Origin Policy. CORS issues are only visible from a real browser origin. Always run integration tests from the actual client.

Incident 3 — Redis outage causes 100% error rate and request hangs

Field Detail
Symptom All bid requests returned HTTP 500 for 12 minutes during a simulated Redis outage. In-flight requests hung indefinitely (no timeout).
Root cause (errors) Redis unavailable; no command timeout configured, so client calls blocked indefinitely.
Root cause (12 min duration) Redis AOF log replay took ~12 minutes after restart due to accumulated log size; until replay completed, state was unavailable.
Recovery Redis restarted; AOF log replayed; full state restored. Auto-reconnect in the Redis client resumed traffic within ~3 s of Postgres availability.
Fix applied Redis AOF persistence confirmed as sufficient for state recovery.
Fix pending Configure socket: { commandTimeout: 3000 } to convert indefinite hangs into fast 500s, allowing upstream circuit breakers and load balancers to route around the failure.
Lesson A command timeout is not a performance optimisation — it is a correctness requirement. An unbounded hang is operationally equivalent to a deadlock.

11. Known Gaps and Future Work

# Gap Priority Notes
G1 No authentication High userId is accepted directly from the client with no verification. Any user can bid as any other user. Requires JWT or session-based auth.
G2 No Redis command timeout High Identified in Incident 3. Fix: socket: { commandTimeout: 3000 }. Converts hangs to fast failures.
G3 No circuit breaker Medium The API continues calling Redis during an extended outage, exhausting connection pools. A circuit breaker (e.g., opossum) would open after a threshold of failures and return 503 immediately, protecting downstream resources.
G4 Real-time queue has no message TTL Medium On gateway restart, RabbitMQ delivers all queued messages at once, flooding connected clients with stale bid updates. A message TTL (e.g., 30 s) would discard messages that are no longer actionable.
G5 Load test targets only one item Low All K6 load is directed at itemId: 1. Parallel writes across multiple items would expose any per-key Redis contention or queue routing issues.
G6 No auction lifecycle Low Items accept bids indefinitely. A production system would need auction start/end times, a scheduler to close auctions, and appropriate state transitions.