Most products today are AI-enhanced (a feature flag on an existing app) or AI-bolted-on (a chat box without product integration). AI-native is a different contract: inference is on the critical path of value delivery, and you design inputs, memory, tools, failure modes, cost, and feedback as one system—not as a model call appended to CRUD.
The focus here is systems design: layered boundaries, orchestration choices, operational seams, and evolution gates. It complements interaction-focused writing (uncertainty UI, appropriate reliance) by answering what to build behind the screen so AI behavior stays observable, bounded, and improvable in production.
1. Three levels of “AI in the product”
| Level | Shape | Typical symptom | Design question | |-------|-------|-----------------|-------------------| | AI-enhanced | Deterministic app + optional assist | Feature usage spikes then flatlines | Does assist change core workflows or sit beside them? | | AI-bolted-on | Chat shell over legacy APIs | Users re-type context the app already has | Who owns state—thread or domain aggregate? | | AI-native | Model in the product loop | Failure modes are designed, not discovered in prod | Can you trace one user outcome to retrieval, tools, policy, and human gates? |
AI-native does not require agents everywhere. It requires that every model invocation has a defined place in architecture: what context it sees, what it may change, and how you measure/regress it.
2. Define the loop before the stack
Before picking a provider or vector database, specify the closed loop the product optimizes:
- User intent & artifacts — What does the user bring (selection, file, ticket ID)? What must never be inferred?
- Context assembly — What is retrieved, remembered, or computed? Who curates freshness and ACL?
- Action surface — Read-only synthesis vs proposed mutations vs autonomous execution?
- Human gates — Which transitions require explicit approval, dual control, or role checks?
- Persistence & audit — What is stored for compliance, replay, and training/eval—not just chat logs?
- Feedback — Edits, rejections, thumbs, task success: how do they flow back to prompts, retrieval, or routing?
If you cannot diagram this loop on one whiteboard, you are shipping a demo. The loop is the bounded context for your AI module—same role a domain aggregate plays in DDD.
User intent → Context assembly → Inference/plan → [Gate?] → Side effects → Audit → Feedback → (retrieval/prompt/routing updates)
3. Reference architecture (logical layers)
Treat AI-native as layers with explicit contracts, not “call OpenAI from the controller.”
flowchart TB
subgraph Experience
UI[UI modes by latency tier]
HITL[Review / approve surfaces]
end
subgraph Orchestration
ORCH[Workflow / agent runtime]
ROUTE[Model & tool routing]
end
subgraph Knowledge
RAG[RAG pipeline]
MEM[Memory tiers]
end
subgraph Action
TOOLS[Tool registry]
POL[Policy engine]
end
subgraph Platform
OBS[Traces / evals / cost]
GOV[Redaction / tenancy / rate limits]
end
UI --> ORCH
HITL --> ORCH
ORCH --> ROUTE
ORCH --> RAG
ORCH --> MEM
ORCH --> TOOLS
TOOLS --> POL
ORCH --> OBS
RAG --> GOV
MEM --> GOV
| Layer | Responsibility | Must not | |-------|----------------|----------| | Experience | Latency-tiered surfaces, streaming, review UX | Embed retrieval SQL or tool credentials | | Orchestration | Steps, branching, retries, cancellation, idempotency keys | Own long-term business invariants | | Knowledge | Chunking, indexing, retrieval, memory write/read policies | Bypass ACL or tenant isolation | | Action | Tool schemas, execution, compensation | Call arbitrary URLs from model output | | Platform | Traces, eval harness, cost attribution, redaction | Be an afterthought bolted on at launch |
Monolith vs microservices is secondary. Seams matter: orchestration should survive swapping embedding model, LLM vendor, or vector store without rewriting business rules.
4. Latency tiers and experience modes
Users judge intent, not milliseconds. Architect separate surfaces per SLA tier—never one button that sometimes feels instant and sometimes hangs eight seconds.
| Tier | Target perceived latency | UX mode | Backend pattern | |------|--------------------------|---------|-----------------| | Inline assist | <500ms or immediate stream start | Ghost text, inline fix, typeahead | Small model / cached retrieval / edge route | | Interactive draft | 2–8s with progressive output | Streaming + stop + edit | Single-shot or short chain | | Analytic job | Seconds to minutes | Async job + notification + diff view | Queue, worker, checkpoint | | Background monitor | Continuous | Digest, alert, dashboard | Scheduled retrieval + rules + model summary |
Anti-pattern: synchronous long-chain agent on the HTTP request thread. Pattern: return jobId early, stream progress over SSE/WebSocket, allow cancel and partial materialization.
Separate tiers also separate cost budgets and model routes—inline assist should not accidentally invoke your most expensive reasoning model.
5. Orchestration: when to use which pattern
| Pattern | Use when | Risks | Controls | |---------|----------|-------|----------| | Single-shot completion | Classification, rewrite, extraction with schema | Schema drift | Structured output + validator | | Fixed chain | Stable pipeline (retrieve → summarize → format) | Brittle to new intents | Version chain; feature flag steps | | Tool-augmented loop (ReAct-style) | Variable steps, external data | Runaway loops, tool spam | Max steps, allowlist, budget | | Plan → approve → execute | Mutations, spend, deploy | User fatigue if overused | Only on high-impact tools | | Supervisor / workers | Parallel subtasks with merge | Duplicated work, merge conflicts | Shared scratchpad + merge policy |
Default for production mutations: plan with dry-run preview → human or policy gate → idempotent execute with audit row.
Orchestration state should be durable (DB or workflow engine), not only in memory—users refresh, workers crash, models timeout.
6. Memory architecture
“Memory” is not one table. Architect tiers with different TTL, ACL, and eval expectations:
| Tier | Scope | Examples | Invalidation | |------|-------|----------|--------------| | Working | Current turn / tool scratchpad | ReAct steps, intermediate JSON | Discard after commit | | Session | Conversation thread | Clarifications, pinned constraints | TTL + user clear | | Episodic | User or project history | Past tickets summarized | User delete; retention policy | | Semantic (RAG) | Org knowledge base | Docs, runbooks, code | Re-index on source change |
Rules worth enforcing early:
- Write path: not every model utterance becomes memory—explicit “remember this” or structured extraction with schema;
- Read path: memory queries respect tenant + role same as primary DB;
- Conflict: session pins beat stale RAG; show provenance when sources disagree;
- Eval: golden questions test retrieval and memory injection, not just final prose.
7. RAG as a pipeline, not a checkbox
Production RAG is ETL + search + ranking + context packing + regression tests:
Ingest → Parse/chunk → Embed → Index (+ metadata/ACL)
Query → Rewrite? → Hybrid retrieve → Rerank → Pack context → Generate → Cite
Design decisions that matter more than “which vector DB”:
| Decision | Trade-off | |----------|-----------| | Chunk size / overlap | Precision vs context window waste | | Metadata filters | Tenant, product, doc version, effective date | | Hybrid search | Vectors miss exact IDs/SKUs; BM25 misses paraphrase | | Reranker | Latency vs precision on top-k | | Context budget | What gets dropped when over token limit—rank, don’t truncate blindly | | Citation contract | Model must cite span IDs; UI validates links |
Anti-pattern: embed once, never re-index, blame the model for stale answers. Pattern: source-of-truth versioning tied to index generation; eval set runs on every index or prompt change.
8. Tool and action layer
Tools are remote procedure calls with liability. Architect them like public API endpoints:
- Allowlist per workflow/tenant—not “model picks any OpenAPI operation”;
- JSON Schema inputs/outputs; reject malformed calls before execution;
- Idempotency keys for create/charge/send;
- Least privilege credentials (scoped tokens, not admin DB);
- Dry-run mode returning diff preview;
- Compensation or manual rollback playbook for partial failure.
Never pass raw model text to eval, shell, or SQL. Parameterized tools only.
Policy engine sits between orchestration and execution: role, amount limits, environment (prod vs staging), data classification (PII block).
9. Boundaries beat prompts
Long system prompts rot across versions, locales, and jailbreak attempts. Hard boundaries enforce what prompts merely suggest:
| Mechanism | Enforces | |-----------|----------| | Tool allowlist | Blast radius | | Output schema / grammar | Parseability | | Pre-inference redaction | PII/secrets never in context | | Post-inference filter | Block patterns, enforce cite-before-claim | | Rate & token caps per tenant | Cost and abuse | | Model routing rules | Cheap model for draft, expensive for review-only paths |
Prompts set tone and task framing; boundaries set safety, cost, and correctness contracts. Version prompts like code; run regression evals when they change.
10. Human-in-the-loop as architecture
HITL is not a disclaimer—it is state machine design:
Proposed → [Review | Auto-approve if policy] → Committed → (optional) Notified
↘ Rejected → logged → may adjust retrieval/routing
| Risk class | Default gate | Autonomy ladder | |------------|--------------|-----------------| | Read-only summary | Stream + sources | Auto | | Draft user message | Review before send | Auto after N accepted edits | | Financial / legal / prod config | Dual control | Never fully auto without policy engine | | Bulk data export | Explicit confirm + audit | Manual |
Increase autonomy only when offline metrics support it: task success, repair cost, override rate, incident count—not demo wow factor.
11. Observability, evals, and continuous improvement
Traditional APM is insufficient. AI-native needs trace spans per request:
- Retrieval candidates + scores (sampled);
- Prompt template version + hash;
- Model route, tokens, latency, cost;
- Tool calls (name, args hash, outcome);
- Human gate outcome;
- User edit distance / rejection reason.
Build a golden eval set early—50–200 real tasks with expected behaviors (citations present, tool not called, refusal on PII). Run on:
- Prompt changes;
- Model upgrades;
- Index rebuilds;
- Router threshold tweaks.
Offline eval gates release; online shadow traffic validates drift. Without this, every deploy is a blind A/B on user trust.
12. Failure modes and degradation
Design degradation ladders, not binary error toasts:
| Failure | User-facing | System | |---------|-------------|--------| | Model timeout | Partial draft + retry | Fallback model or retrieve-only mode | | Retrieval miss | “No evidence in your docs” | Widen search once; then abstain | | Tool failure | Explain which action failed | Retry idempotent; don’t loop blindly | | Context overflow | Summarize older turns | Structured compaction, not silent drop | | Policy block | Clear reason + alternative | Log for security review | | Provider outage | Queue or read-only features | Multi-vendor route if contract allows |
Abstain is a feature—confident wrong answers cost more than “I don’t know from approved sources.”
13. Multi-tenancy, cost, and routing
Architect cost as a first-class dimension:
- Per-tenant token budgets and soft/hard caps;
- Model routing: small/fast for classification, large for synthesis;
- Cache embeddings and repeated retrieval queries;
- Batch offline jobs off peak;
- Show cost attribution to product teams (feature × tenant × model).
Routing policy belongs in configuration, not scattered if statements in handlers—same pattern as feature flags for model experiments.
14. Evolution path: copilot → governed autonomy
A pragmatic roadmap with explicit gates:
- Read-only copilot — RAG + citations; no tools; establish eval baseline;
- Propose-only tools — preview mutations; measure override rate;
- Policy-gated auto — low-risk paths auto-commit within rules;
- Supervised agents — multi-step with checkpoints; narrow domain;
- Broader autonomy — only with incident playbooks, kill switches, and regression budget.
Skipping steps 1–2 because “agents are the future” is how production learns about tool overreach from angry customers instead of from eval dashboards.
15. Anti-patterns (checklist)
- Chat as the only interface while the app already holds structured state;
- God prompt replacing product rules and ACL;
- Synchronous multi-tool agent on user-facing latency tier;
- Storing full transcripts as the only audit artifact;
- No idempotency on tool side effects;
- One global model for all tiers and tenants;
- Shipping prompt changes without eval regression;
- Treating RAG index as write-once infrastructure;
- Autonomy on day one for irreversible actions.
Summary
AI-native architecture is systems engineering with stochastic components:
- Loop first — intent, context, gates, audit, feedback;
- Layered seams — experience, orchestration, knowledge, action, platform;
- Tier latency and cost — different surfaces, models, and backends;
- RAG and memory as pipelines with ACL, versioning, and evals;
- Tools as governed APIs — allowlist, schema, idempotency, policy;
- Observability + golden sets — ship changes with evidence, not hope;
- Evolve autonomy in stages — propose → policy → auto, measured at each gate.
Models will change weekly; boundaries, traces, and eval harnesses are what keep the product trustworthy across replacements. Interaction design asks how users calibrate reliance; the system underneath must earn that reliance in production.