LLM Cost per Query: The Real Unit Economics Framework

Summary

Token price is the sticker, not the bill. The published rate covers one clean call with a first-try answer, and says nothing about the retries, retrieval, rerankers, guardrail models, and human overrides that often run three to six times the headline number. Budget on the sticker and you are the manufacturer who priced the product at raw materials and forgot labor, scrap, and rework. The right unit is the fully loaded cost of an accepted answer: everything you spent to produce one output a user actually kept. Measure it and the levers that truly cut it become obvious.

Context

Token price is the sticker, not the bill

Vendors quote LLMs in dollars per million tokens, and finance teams pin their budgets to that number. It is the wrong number. The published rate covers one successful call to a model with a clean prompt and a first-try answer. It says nothing about the retries you fired when the first output failed validation, the retrieval calls that stuffed context into the prompt, the reranker that ran before that, the guardrail model that screened the output, or the human who overrode the answer before it shipped. Those are the costs that decide whether a feature is profitable, and none of them appear on the pricing page. A team that ships on the sticker price is like a manufacturer that costs a product at the raw-material rate and forgets labor, scrap, and rework.

The right unit is the fully loaded cost of an accepted answer: everything you spent to produce one output a user actually kept, divided by the count of accepted outputs. Consider a worked example. A support-summarization feature runs on a mid-tier model at roughly 2,000 input tokens and 400 output tokens per call. At list prices that is about 1.9 cents of raw generation. But the feature retries 1.4 times on average because the first draft often misses the ticket resolution, each call pulls eight retrieved snippets through an embedding and rerank step, every output passes a guardrail model, and reviewers edit one answer in four. Roll all of that up and the cost of one summary a user kept is 11 cents, roughly six times the headline. Nothing was broken. The gap was simply the part of the pipeline the sticker price ignores. If you plan capacity or set customer pricing off the sticker, you underprice by that same multiple, and at ten million answers a year the difference is real money, not a rounding error. Worse, the sticker misleads teams into optimizing the wrong thing: they chase a cheaper model to shave the 1.9-cent line while the 9 cents of surrounding work goes untouched.

The pattern

Where the money actually goes

Break the cost of one answer into its components and each becomes a line you can measure and attack. The table below decomposes that 11-cent support summary. Retries and retrieval, not raw generation, dominate the bill, which is exactly where most cost programs never look because they read the model invoice and stop there.

Cost component	Share of loaded cost	Typical driver	Lever that moves it
Base generation tokens	17%	Prompt plus completion length	Trim system prompt, cap max tokens
Retries and re-asks	29%	Failed validation, malformed output	Tighter schema constraints, fewer round trips
Retrieval and reranking	24%	Oversized top-k, no embedding cache	Cache query embeddings, tune k from 12 to 5
Guardrail and eval passes	13%	Screening every output with a second model	Route only low-confidence calls to review
Human override time	17%	Reviewers editing weak answers	Raise draft quality so edits drop

The pattern repeats across features: the model is rarely the expensive part. The expensive part is the work you do around a model that is not quite good enough on the first pass, so you pay for it two or three times. Cut the retry rate from 1.4 to 1.05 and drop retrieval from twelve snippets to five, and the same feature falls from 11 cents to about 7 cents an answer, a 36 percent reduction with no change to the model at all. That is why the unit economics are a product problem, not a procurement problem.

How to apply

Measure the answer, not the call

Define the accepted-answer unit first: one output a user kept, and divide total pipeline spend by that count rather than by API calls, so waste has nowhere to hide.
Instrument every stage with a request ID so a single answer's retries, retrieval, guardrail, and override costs roll up to one traceable number you can chart over time.
Track retry rate as a first-class metric; a feature at 1.4 retries per accepted answer is paying 40 percent overhead before you optimize anything else, and that overhead scales linearly with volume.
Cache aggressively where inputs repeat: query-embedding caches and prompt caching on stable system prompts routinely cut retrieval and base-token cost by 30 to 50 percent on high-traffic features.
Route by difficulty: send the easy 70 percent to a smaller cheaper model and reserve the frontier model for the hard tail, then compare loaded cost per accepted answer, not per token, before you trust the saving.

Common pitfalls

Where cost models go wrong

Budgeting off sticker token price. Fix: model the fully loaded accepted-answer cost, then apply a measured overhead multiple, often 3x to 6x, until you have real traces to replace the estimate.
Ignoring retries because each one is cheap. Fix: they compound, so track retries per accepted answer and treat anything above 1.3 as a defect to fix, not a cost to absorb.
Oversized retrieval top-k copied from a tutorial. Fix: sweep k downward while watching answer quality on a held-out set; most stacks hold quality at k=5 that they ran at k=12, halving retrieval spend.
Screening every output with a second model. Fix: gate guardrail passes on a confidence score so you only pay for review on the calls that need it, not the 80 percent that pass cleanly.
Counting only machine cost and hiding human override. Fix: put reviewer minutes into the same unit; a 90-second edit at loaded labor rate can dwarf the token bill and reverse which lever matters most.

Quick-win checklist

Get to a real number fast

Pick one feature and compute its loaded cost per accepted answer this week.
Add request-ID tracing so retries, retrieval, and overrides attach to one answer.
Turn on prompt caching for stable system prompts and measure the base-token drop.
Cut retrieval top-k by half and confirm quality holds on your eval set.
Publish the retry rate on your team dashboard next to latency and quality.

Cost Economics of LLMs: The Real Unit Cost of an Answer

Token price is the sticker, not the bill

Where the money actually goes

Measure the answer, not the call

Where cost models go wrong

Get to a real number fast

This is a taste. The full library goes deeper.

Stratenity is the AI Operating System for Strategic Execution^™.

Cost Economics of LLMs: The Real Unit Cost of an Answer

Token price is the sticker, not the bill

Where the money actually goes

Measure the answer, not the call

Where cost models go wrong

Get to a real number fast

Found this useful? Pass it on.

This is a taste. The full library goes deeper.

Stratenity is the AI Operating System for Strategic Execution™.

Stratenity is the AI Operating System for Strategic Execution^™.