The “Two-Model” Pattern for Cost & Reliability
Cross-Industry • ~8–9 min read • Updated Jan 11, 2025 • By OneMind Strata Team
Context
Running everything through your best model is easy—and expensive. Production teams win with a simple play: route most traffic to a cheap model and escalate only when signals say “this is hard or high-stakes.” Done well, this cuts spend, protects latency budgets, and improves reliability without sacrificing outcomes.
Core Framework
- Three routes, not two: cheap → smart → human.
- Cheap: default, fast responses for clear/easy cases.
- Smart: bigger model or tool-augmented path for ambiguous/critical cases.
- Human: confirm/override for high-risk or policy-triggered cases.
- Escalation signals: Decide what triggers handoff:
- Confidence: self-score, logprob bands, or evaluator micro-model.
- Policy: safety/PII classifiers, tenant/role constraints.
- Task: category-specific rules (financial advice, medical claims, critical ops).
- Structure: required fields missing; format not met.
- Path SLOs: Own budgets per route: p95 latency, answerability, override burden, and cost per accepted action.
- Cache & reuse first: Put caching, rerankers, and snippets before escalation to avoid unnecessary “smart” calls.
Recommended Actions
- Define route criteria: A single page listing escalation signals and thresholds (cheap→smart, smart→human).
- Implement a route tag: Log
route=cheap|smart|human
on every trace withroute_reason
and threshold values. - Wire a verifier: Use a tiny evaluator model or heuristics to judge “answerability to rubric.”
- Set budgets: p95 latency per route, target escalation rate (e.g., ≤ 18%), and max cost-per-accepted-action.
- Freeze a fallback: For timeouts or model errors, return a rule-based template or last-known-good draft.
Reference Architecture
- Ingress: request + task_id + surface + context.
- Cheap path: small model → validator (format/policy) → accept if pass.
- Escalator: if fail or “low-confidence,” send to smart path (bigger model, retrieval + tools).
- Human gate: if still low-confidence or policy at risk, open confirm/override with rationale & evidence.
- Telemetry: traces include route, thresholds, model/prompt versions, guardrail events, latency, cost.
Common Pitfalls
- Over-escalation: thresholds too conservative → costs spike; tune against golden sets.
- Unowned budgets: no SLO owner per path → drift goes unnoticed.
- Opaque routing: no route_reason → you can’t debug escalations or reduce them.
- Missing fallback: timeouts on smart path stall users; always return a safe alternative.
Quick Win Checklist
- Add
route
androute_reason
to traces; graph escalation rate daily. - Stand up a tiny evaluator (rubric 0–3) and escalate on scores ≤ 1.
- Set p95 latency budgets per route and block releases that regress.
- Cache successful responses for frequent tasks; measure cache hit rate.
- Publish a one-pager: thresholds, owners, and rollback plan.
Closing
The two-model pattern is really a three-route operating model: cheap by default, smart when justified, human when needed. With clear signals, route-aware SLOs, and solid fallbacks, you’ll ship faster, spend less, and raise trust over time.