Token pricing is the headline. The unit economics live in retries, retrieval, caching, and override. A framework for measuring what an answer really costs.
The token bill is not the unit cost
Finance teams budget for AI by reading token prices. Operators discover the real cost a quarter after the program ships. Retries double inference. Retrieval calls hit vector databases and re-rankers. Caches miss. Humans override and re-prompt. Evaluators re-score outputs in production for quality assurance. The unit cost of an answer is a stack, and most of it lives off the model invoice.
The gap matters because it changes which use cases survive ROI scrutiny. Programs that look profitable at the token-bill layer are sometimes loss-making at the unit-cost layer once override and retrieval are added. Programs that look marginal at the token-bill layer compound profitably once caching is honestly measured. Honest measurement reorders the portfolio.
The five layers of an answer's cost
Inference is the first layer. Input tokens and output tokens at the model's listed price. This is the number on the invoice and the one most teams stop at.
Retrieval is the second layer. Embedding generation, vector search calls, and re-ranking add cost that is often 1.5 to 3 times inference for retrieval-augmented systems. The retrieval layer also has its own latency budget that affects user experience even when it does not appear on the model bill.
Evaluation is the third layer. Continuous quality scoring runs in production, often by a more expensive model than the one producing the answer. Teams that skip this layer save money in the dashboard and lose it through undetected quality drift.
Override is the fourth layer and the most expensive when measured honestly. Human time spent correcting, escalating, or re-prompting an AI answer can dwarf the token cost. A use case with a 30 percent override rate is a different economic proposition from the same use case with a 5 percent rate, even though the model bill is identical.
Memory and caching is the fifth layer. Storage costs are modest. The operational drag of cache invalidation, refresh strategy, and stale-answer mitigation is not, and it shows up in engineering time rather than infrastructure bills.
What good measurement looks like
Honest measurement requires per-decision attribution. Tie every dollar to the decision unit it produced rather than aggregating across use cases. Maintain override telemetry that logs who overrode what, why, and how long it took. Track cache-hit rates on dashboards that operators see weekly, because cache economics dominate at scale. Use the two-model pattern, routing easy queries to a small model and reserving the large model for the hard tail. In most workloads the two-model pattern drops unit cost by 40 to 60 percent without measurable quality loss.
Teams that adopt these four practices stop arguing about token prices and start arguing about the work the system is making possible. The conversation gets simpler, more honest, and usually more defensible than the token-bill story suggested.