Realtime systems architecture

Realtime is not a transport choice. It is a contract between clients, gateways, coordination stores, and the system of record about what must arrive how fast, in what order, with what durability, and how recovery works when networks flap or nodes restart.

Products that treat realtime as “add Socket.IO” usually discover the gap in production: messages appear out of order after reconnect, presence lies, typing indicators survive but chat history does not, or a deploy triggers a reconnect storm that takes down Redis. This note organizes the design space: primitives, write/read paths, consistency, scale, and failure—so you can prototype the hard paths before the UI freezes around wrong assumptions.

1. Specify the contract before the stack

Different features need different guarantees. Write them down explicitly:

| Question | Drives | |----------|--------| | Must the user see their own write immediately? | Optimistic UI vs server ack | | Can others see it before DB commit? | Ephemeral fan-out vs transactional outbox | | Is exact order required per thread/document? | Monotonic seq + catch-up | | Is loss acceptable for ephemeral signals? | UDP-like shed vs at-least-once | | What happens after 24h offline? | Snapshot + delta vs full replay | | Who may subscribe to which channel? | ACL at handshake and per topic |

If two features share a connection but not a contract, split surfaces—or you will ship one backpressure policy that silently wrongs both.

Client ──► Gateway ──► Fan-out bus ◄──► other Gateways
              │              ▲
              │              │ publish (after commit or policy)
              ▼              │
         Catch-up API ◄── System of record (SSOT)

Rule of thumb: anything involving money, permissions, or audit lives in SSOT first; the wire carries derived events with sequence numbers clients can reconcile.

2. Three primitives (and what each optimizes)

Most production stacks decompose into transport, fan-out, and storage—but the interfaces between them matter more than brand names.

| Primitive | Job | Typical options | |-----------|-----|-----------------| | Transport | Long-lived client ↔ edge path | WebSocket, SSE, HTTP/2 streams, QUIC | | Fan-out | Cross-connection / cross-node delivery | Redis Pub/Sub, Redis Streams, NATS, Kafka | | SSOT + coordination | Durability, ordering anchor, ephemeral state | RDBMS + outbox; Redis for presence/rate limits |

2.1 Transport selection

| Transport | Direction | Strength | Weakness | |-----------|-----------|----------|----------| | WebSocket | Bidirectional | Low latency; binary frames | Proxies, LB idle timeouts, sticky-session traps | | SSE | Server → client | Simple over HTTP; auto-reconnect in browsers | No native client→server on same channel | | Long poll | Pseudo push | Works everywhere | Latency + load at scale | | QUIC streams | Bidirectional | Mobile/unstable networks | Ecosystem maturity varies |

Pick transport for client constraints (browser, mobile, corporate proxy), not because a tutorial defaulted to WS. Many apps combine HTTP for writes + SSE/WS for downlink—cleaner idempotency and retries on the write path.

2.2 Fan-out guarantees

| Mechanism | Delivery | Use when | |-----------|----------|----------| | Pub/Sub | Fire-and-forget; subscribers offline = loss | Ephemeral: typing, live cursors (if you must) | | Streams / log | At-least-once; consumer groups | Room history fan-out, cross-region relay | | Brokerless rooms | In-memory on one node | Prototype only; plan migration early |

Do not use Pub/Sub alone for chat message delivery. Pair DB commit (or outbox) with fan-out so reconnect has something durable to catch up from.

3. Write path: how events enter the system

The dominant production pattern:

Client sends mutation via HTTP/gRPC (idempotency key optional but recommended);
Application validates, writes SSOT in a short transaction;
Outbox row or domain event written in the same transaction;
Relay publishes to fan-out with {channel, seq, payload, trace_id};
Gateways push to subscribed sockets with backpressure-aware queues.

Anti-pattern: socket.emit in the handler that also writes DB—on crash you get ghost messages or missing messages. Anti-pattern: publish before commit—subscribers see data that rolls back.

For optimistic UI, the client may render pending state locally, but server seq reconciles on ack; never let client-generated IDs be the global order.

4. Read path: live stream + catch-up

Reconnect is the real test. Every durable stream needs:

since_seq / since_revision on reconnect handshake or parallel HTTP API;
Gap detection — if seq jumps, client fetches [last+1, current) before applying live frames;
Idempotent apply — same (channel, seq) applied twice must not duplicate UI rows;
Snapshot boundary — after N days or M messages, offer compact snapshot + tail delta.

Live WS/SSE carries tail; HTTP catch-up carries history. Mixing both on one undifferentiated channel forces you to choose between buffer bloat and silent loss.

5. Ordering: scope and IDs

Define order per conversation, document, or shard—not globally. Global sequence is expensive and rarely matches user mental models.

| Approach | When | |----------|------| | Server monotonic seq per channel | Chat, feeds, notifications | | Version / revision per document | Collaborative docs; pair with merge policy | | Wall-clock timestamps | Display only—not for correctness (clock skew) | | CRDT / OT | Concurrent edits with automatic merge |

Assign seq at commit time in SSOT or outbox relay, not at gateway receive time—otherwise multi-writer races reorder under load.

6. Consistency spectrum

Realtime sits between “instant” and “correct.” Name your tier per feature:

| Tier | Behavior | Example | |------|----------|---------| | Strong per resource | Read after write sees committed state | Send message → appears in history API | | Eventual across viewers | Sub-second skew acceptable | Feed fan-out lag | | Ephemeral best-effort | Loss OK within bounds | Typing, presence coarse state | | Read-your-writes | Author always sees own action | Optimistic send + server ack |

Presence and typing are not chat. Give them weaker guarantees and separate channels so backpressure does not drop messages to preserve cursors.

7. Presence: signal, not spam

Presence answers: who is here, on what resource, with what capability? (viewing, editing, admin.)

Design choices:

Heartbeat interval × TTL multiplier — e.g. 30s heartbeat, 90s TTL before “away”;
Do not rely on disconnect alone — mobile backgrounds lie; TCP half-open lingers;
Batch and throttle — aggregate updates per room (500ms–2s), not per mousemove;
Capability in payload — {userId, resourceId, mode, lastSeen} not just online boolean;
Rebuild path — on gateway restart, presence is derived; must be recomputable from heartbeats or SSOT session table.

Unless the product is live cursors, 60Hz position broadcast will crowd out real payloads under load.

8. Backpressure is product behavior

When egress queues fill, the system is telling you something—shed with intent, not random drop.

Suggested priority (highest last to drop):

Typing / ephemeral indicators
Presence deltas
Read receipts
Non-critical metadata
Message payloads and state mutations

Also:

Coalesce per channel (keep latest typing, latest presence);
Cap per-connection send buffer; signal degraded to client;
Client UX — “Reconnecting…”, “Showing cached messages”, partial disable of live features beats silent loss.

Backpressure policy belongs in gateway code and runbooks, not only in postmortems.

9. Connection lifecycle and reconnect storms

9.1 Handshake

Bind early: userId, deviceId, protocol version, auth exp, permitted rooms. Validate Origin (CSWSH), use WSS, short-lived access token + refresh outside the socket.

9.2 Reconnect

Exponential backoff with jitter on client;
Resume token or since_seq to avoid full replay;
Server-side rate limit on handshake during incidents;
Graceful drain on deploy: stop accept → wait for idle → terminate—to prevent synchronized thundering herd.

9.3 Scale-out

| Model | Trade-off | |-------|-----------| | Sticky sessions | Simple; painful deploys; uneven load | | Stateless gateways + shared fan-out | Preferred; any node; requires shared bus | | Shard by room/user | Hot room isolation; routing complexity |

Business tier should not hold socket handles—only gateways subscribe and push.

10. Collaboration: when CRDTs and OT enter

Not every realtime app needs CRDTs. Use server authority + seq when:

Single writer per message row;
Conflicts are rare and resolved by “last commit wins” with audit.

Consider OT/CRDT when:

Multiple users edit the same document field concurrently;
Offline edits merge on reconnect;
Latency hides server round-trips for keystrokes.

Hybrid is common: CRDT/OT for document body, seq-ordered chat for comments, ephemeral presence for cursors—three contracts, three channels.

11. Multi-region and latency

Place gateways near users; keep SSOT authoritative in one region unless you invest in multi-master conflict design;
Cross-region fan-out adds visible lag—set product copy and UI accordingly;
CRDTs help merged editing across lag; chat seq still wants a single ordering shard or accepted fork+repair semantics.

Measure P99 push latency (commit → client render), not just ping.

12. Observability: what to measure

Correlate HTTP writes and WS pushes with shared trace_id:

| Metric | Why | |--------|-----| | Active connections / churn rate | Deploy and incident detector | | Handshake failure / auth reject | Attack or config drift | | Queue depth per connection | Backpressure early warning | | Commit → fan-out → deliver latency | End-to-end SLO | | Catch-up API P99 size & duration | Reconnect health | | Gap fetch rate | Ordering or relay bugs | | Dropped-by-priority counters | Shed policy tuning |

Log channel, seq, userId (redacted), event type—not full payloads in production.

13. Security seams

Subscribe ACL — proving JWT at handshake is insufficient; check room membership per topic subscription;
Rate limits on publish and handshake;
Payload size caps; binary vs JSON schema validation;
Cross-tenant isolation in channel names and Redis keys (tenant:{id}:room:{id});
Audit admin actions that broadcast to many users.

A leaked room ID must not grant access without server-side authorization on subscribe.

14. Evolution path

Phase A — Prototype — single node, in-memory rooms, HTTP writes; learn latency UX;
Phase B — Production core — SSOT + outbox, Redis fan-out, stateless gateways, catch-up API;
Phase C — Scale — separate gateway pool, Streams/consumer groups, hot-room sharding, regional edges;
Phase D — Collaboration depth — CRDT/OT only where product requires; keep chat on seq.

Each phase needs load tests with reconnect (deploy simulation, Wi‑Fi flap, 10k clients joining one room)—not steady-state HTTP benchmarks alone.

15. Anti-patterns

One global event bus with one priority for everything;
No seq — clients sort by Date.now();
Treating Redis Pub/Sub as message store;
Dual write socket + DB without outbox;
Sticky sessions without drain plan;
Silent drop under load with no client degraded state;
God room — millions of subscribers on one channel without edge aggregation;
Storing connection state in SSOT (rows per open socket);
Skipping catch-up API because “WS will replay”;
Global ordering “for simplicity.”

Summary

Production realtime is systems engineering with latency budgets:

Contract first — durability, order scope, loss tolerance per feature;
SSOT + outbox → fan-out → gateway — not socket-first writes;
Live tail + HTTP catch-up — reconnect is a first-class path;
Order per shard — server seq at commit; gap repair;
Tier consistency — ephemeral presence ≠ chat messages;
Backpressure with priorities — shed and coalesce deliberately;
Stateless gateways, shared bus, observable end-to-end latency;
CRDT/OT only where concurrent editing demands it.

Prototype the reconnect and deploy paths early—UI polish hides irreversible wrong contracts. Transport is the easy part; recovery under load is what separates demos from products.