The “Two-Model” Pattern for Cost & Reliability

Context

Running everything through your best model is easy and expensive. Production teams win with a simple play: route most traffic to a cheap model and escalate only when signals say “this is hard or high-stakes.” Done well, this cuts spend, protects latency budgets, and improves reliability without sacrificing outcomes.

Core Framework

Three routes, not two: cheap → smart → human.
- Cheap: default, fast responses for clear/easy cases.
- Smart: bigger model or tool-augmented path for ambiguous/critical cases.
- Human: confirm/override for high-risk or policy-triggered cases.
Escalation signals: Decide what triggers handoff:
- Confidence: self-score, logprob bands, or evaluator micro-model.
- Policy: safety/PII classifiers, tenant/role constraints.
- Task: category-specific rules (financial advice, medical claims, critical ops).
- Structure: required fields missing; format not met.
Path SLOs: Own budgets per route: p95 latency, answerability, override burden, and cost per accepted action.
Cache & reuse first: Put caching, rerankers, and snippets before escalation to avoid unnecessary “smart” calls.

Recommended Actions

Define route criteria: A single page listing escalation signals and thresholds (cheap→smart, smart→human).
Implement a route tag: Log route=cheap|smart|human on every trace with route_reason and threshold values.
Wire a verifier: Use a tiny evaluator model or heuristics to judge “answerability to rubric.”
Set budgets: p95 latency per route, target escalation rate (e.g., ≤ 18%), and max cost-per-accepted-action.
Freeze a fallback: For timeouts or model errors, return a rule-based template or last-known-good draft.

Reference Architecture

Ingress: request + task_id + surface + context.
Cheap path: small model → validator (format/policy) → accept if pass.
Escalator: if fail or “low-confidence,” send to smart path (bigger model, retrieval + tools).
Human gate: if still low-confidence or policy at risk, open confirm/override with rationale & evidence.
Telemetry: traces include route, thresholds, model/prompt versions, guardrail events, latency, cost.

Common Pitfalls

Over-escalation: thresholds too conservative → costs spike; tune against golden sets.
Unowned budgets: no SLO owner per path → drift goes unnoticed.
Opaque routing: no route_reason → you can’t debug escalations or reduce them.
Missing fallback: timeouts on smart path stall users; always return a safe alternative.

Quick Win Checklist

Add route and route_reason to traces; graph escalation rate daily.
Stand up a tiny evaluator (rubric 0–3) and escalate on scores ≤ 1.
Set p95 latency budgets per route and block releases that regress.
Cache successful responses for frequent tasks; measure cache hit rate.
Publish a one-pager: thresholds, owners, and rollback plan.

Closing

The two-model pattern is really a three-route operating model: cheap by default, smart when justified, human when needed. With clear signals, route-aware SLOs, and solid fallbacks, you’ll ship faster, spend less, and raise trust over time.

Essay by OneMind Strata Team