Large language models put generative, stochastic, context-bound reasoning on the critical path, while users still meet them through deterministic GUI grammar (buttons, forms, single “right” answers). Traditional software UX asked whether the system executed the spec; the generative era asks: when the system is unsure, how should people form accurate mental models, calibrate reliance, and still finish the task when things go wrong?

The focus here is representation, control, responsibility, and evaluation—not palettes or motion.

1. Reframing the problem: from usability to appropriate reliance

1.1 Two kinds of uncertainty

| Type | Meaning | UI implication | |------|---------|----------------| | Epistemic | Lack of knowledge; more data could help | “May be incomplete”; show sources | | Aleatoric | Irreducible noise | Multiple samples, ranges, alternatives |

Conflating them yields absurd UI: promising “try again for truth” on aleatoric variance, or a weak “for reference only” on fixable epistemic errors.

1.2 Appropriate reliance

Reliance should match true capability—easy to forget in product reviews.

  • Over-reliance: automation bias, citing hallucinations, shipping drafts as final;
  • Under-reliance: discarding useful retrieval aids, redundant work;
  • The goal is not “always trust AI” but risk-adjusted calibration.

High-stakes domains (clinical, legal, industrial) need different defaults than brainstorming or email polish—not the same cheerful tone.

2. Interaction paradigms: chat is not a universal container

2.1 Three paradigms

| Paradigm | Control | Strength | Typical failure | |----------|---------|----------|-----------------| | Conversational | User-turn driven | Exploration, low learn cost | Context drift, fuzzy responsibility, unscannable threads | | Embedded copilot | User on main task | Workflow fit, partial accept | Interrupts flow; weak accept/reject signals | | Agentic | System plans; human gates | Long automation | Opaque plans, tool overreach, hard rollback |

In mixed-initiative systems, human-led and machine-led initiative should switch with visible, reversible handoffs.

2.2 Natural language is not the only API

NL is high bandwidth but ambiguous. Complementary channels:

  • Structured slots (forms, constrained picks) reduce intent error;
  • Examples / counter-examples in UI beat empty prompts;
  • Explicit modes (“retrieve-only / reason / tools allowed”) beat hidden prompt hacks.

3. Deterministic UX grammar: state machines over magic spinners

3.1 A minimal state set

Idle → IntentParsing → Retrieving? → Generating(stream) → Review → Committed
                  ↘ NeedsClarification ↗          ↘ Failed → Retry / Fallback

Each state needs distinct visual weight:

| State | User should know | Actions | |-------|------------------|---------| | IntentParsing | Aligning task; no content yet | Cancel | | Retrieving | Answer depends on evidence | See scope | | Generating | Tokens arriving; may revise | Stop | | NeedsClarification | Ambiguity or missing permission | Clarify / change constraints | | Review | Draft ≠ final | Edit / accept / reject | | Failed | Known failure class | Retry / fallback / human |

If “thinking” exceeds N seconds with no partial signal, escalate to cancellable long task or async notify—otherwise users attribute hang, not slow reasoning.

3.2 Cognitive cost of streaming

Streaming cuts time-to-first-token but introduces:

  • Anchoring on early lines;
  • Unstable reading while text grows;
  • Screen-reader failure if every token announces.

Mitigations: sentence/paragraph throttling; optional collapse-until-done; one polite aria-live summary at completion.

4. Representing uncertainty: beyond scalar “confidence”

4.1 Avoid fake precision

“97% confident” is often uncalibrated. Without reliability diagrams, prefer ordinal bands (low / med / high) or qualitative labels (retrieved vs inferred).

4.2 More defensible representations

  1. Provenance — Separate memory, RAG spans, tool outputs; clickable evidence beats a lone score.
  2. Self-consistency made visible — Divergent samples → clustered alternatives, not forced merge.
  3. Verifiability — One-click re-run code, query DB, open logs; turn epistemic gaps into user-checkable tasks.
  4. Edit distance as signal — Chronic heavy rewrites → shorter defaults or more confirmation steps.

4.3 Ethics of alternatives

Multi-answer UI should label axes of difference (conservative vs bold), highlight best-fit default, fold others, and log choices without dark patterns.

5. Control, responsibility, undo

5.1 Sense of control

Preview tool calls; plan → approve → execute; global undo to pre-generation snapshots.

5.2 Responsibility boundaries

UI must clarify who owns outcomes. Low risk: default accept + audit log. High risk: explicit confirm + auth + tamper-evident records. Avoid anthropomorphic “I think…” blurring system suggestion vs user decision in copy structure.

5.3 Pinning versions

Let users pin a baseline output; later edits are incremental commits—essential for collaboration and compliance.

6. Cognitive load: information architecture, not minimalism for its own sake

“Simple” often wrongly means hidden power.

  • One primary CTA per task phase, not per page only;
  • Expert toggles for temperature, retrieval scope, model choice;
  • Long threads: topic segments + collapsible summaries linked to evidence.

Surface context limits: token budget, attached files and expiry, when users must re-supply facts—combat false memory assumptions.

7. Failure catalog designers should pre-empt

| Failure | User feel | Design response | |---------|-----------|-----------------| | Hallucinated citations | Betrayal | Forced citation preview; downgrade tone without sources | | Sycophancy | Comfort, wrong decisions | Optional “critical” persona; show counterpoints | | Tool overreach | Fear | Least privilege; pre-call preview | | Silent timeout | Anxiety | Heartbeat + cancel | | Over-automation | Loss of control | Human gates on critical steps |

8. Evaluation: interaction needs experiments

Task-level metrics beyond click A/B:

  • Task success rate under time limits;
  • Misunderstanding rate (user restatement vs system intent);
  • Reliance bias (should trust / shouldn’t trust encodings);
  • Repair cost (time/steps to correct outputs);
  • Abandonment (cancel, escalate to human, disable feature).

Run contrasts: same capability, different UI (with/without provenance, alternatives)—measure reliance and mistaken adoption, then decide from data.

9. UI as part of alignment

RLHF/constitutional work shapes output distributions; UI is normative too: what requires re-confirm, what is refused, how preferences are collected from edits and rejections. Ignoring UI pushes the last reliance calibration entirely into one-shot prompts.

Summary

Intelligent interaction is not “make AI more human.” It is:

  1. Deterministic, observable states atop stochastic inference;
  2. Provenance, alternatives, verifiability—not fake scores;
  3. Risk-adjusted control and responsibility with undo;
  4. Task experiments on appropriate reliance, not vibe checks alone.

Underneath the patterns: how people finish goals when systems are not fully reliable. Stacks change; that problem does not.