Error States that Build Trust

Context

LLM systems fail differently: retrieval gaps, tool timeouts, policy refusals, vendor drift. Users don’t judge you for failing, they judge you for how you fail. Trust grows when errors preserve context, offer a safe next step, and show enough evidence that the system isn’t guessing or hiding.

What Good Looks Like

Recognizable category: Label the failure (timeout, retrieval miss, policy refusal) so users know what happened.
Safe fallback: Provide a partial answer, cached result, or human-escalation path, never a dead end.
Show your work: Cite sources or tests used to make the decision; display last-eval date for the model/prompt pack.
Preserve input & progress: Keep the user’s text and state so a retry doesn’t mean rework.
Retry guidance: Offer a specific, low-effort adjustment (e.g., “Narrow by date range”).
Correlation ID: Render a short error code; copyable for support & incident review.

Core Patterns

Tiered responses: Soft warning → degraded output → refusal with rationale; escalate by risk, not by guess.
Partial answers: Return a safe subset (e.g., top 2 docs with confidence) with a note about the missing pieces.
Fallback routes: Vector miss → keyword fallback; tool timeout → cached plan; vendor outage → small local model.
Policy-aware refusal UX: Clear “why” + alternatives; link to policy name/owner for credibility.
Evidence banner: Show retrieved items or tests that failed; avoid opaque “something went wrong.”

Recommended Actions

Adopt an error taxonomy: {timeout, retrieval_miss, policy_refusal, tool_error, vendor_outage, quota, unknown}.
Ship a reusable component: <AiErrorCard/> with slots for category, evidence, correlation_id, next_actions.
Pre-bake fallbacks: Decide the safe degraded response per use case; test it in CI with golden failures.
Wire observability: Log {category, correlation_id, surface, model_version, prompt_pack, guardrail_scores}.
Copy system: Centralize error copy with tone, brevity, and domain examples; localize titles.
Run failure drills: Chaos-test tool timeouts, retrieval zero-hits, and vendor 500s monthly.

Common Pitfalls

Generic messages: “Try again later” teaches nothing and drives abandonment.
Input loss: Clearing the compose box after an error multiplies user friction.
Infinite retries: Repeat failure loops without suggesting a different route.
Opaque policy blocks: Refusing without citing the policy or offering escalation.
Silent degradation: Returning degraded content without telling the user.

Quick Win Checklist

Implement <AiErrorCard/> with a correlation ID and a “Copy details” button.
Add keyword fallback for retrieval zero-hits; cache a last-good answer for timeouts.
Expose why refused with policy name and contact.
Keep user input; pre-fill retry with suggested edits.

Closing

Design error states as carefully as success states. With clear categories, evidence, and pre-planned fallbacks, you’ll turn inevitable failures into moments that increase trust, and keep work moving.

Essay by OneMind Strata Team