Summary

Most retrieval-augmented systems underperform because the knowledge base is the bottleneck. A four-week sweep that fixes content quality before the model touches it.

Context

The knowledge base is the model's prompt

When retrieval-augmented systems disappoint, teams blame the model. They tune prompts. They swap embedding models. They add a re-ranker. The system improves marginally and the team moves on to the next experiment. The truth is almost always upstream. The knowledge base feeding the system has duplicates, stale versions, conflicting policies, broken metadata, orphaned drafts, and chunked content that does not match how questions are actually asked. No amount of model work can compensate for an upstream content layer that contradicts itself.

A four-week sweep done before the model is connected produces a step-change in answer quality. The sweep is unglamorous. It is also the highest-leverage work in the program. Teams that skip it spend the next twelve months tuning a model layer that was never the constraint.

Core Framework

Three disciplines that change the result

The first discipline is single-source-of-truth. For every topic the system will answer about, name the canonical document. Everything else is decoration. When the canonical document and a derived copy disagree, the model has no basis to choose between them. The model will choose, and the choice will be wrong some fraction of the time. Eliminate the disagreement at the source.

The second discipline is the freshness contract. Each canonical document has a named owner, a refresh cadence, and a last-reviewed date that is enforced through automation. Documents that miss their refresh window are flagged in the retrieval layer or removed from the corpus. Stale content is worse than no content, because the model gives stale content the same confidence as fresh content.

The third discipline is retrieval-aware structure. Documents are chunked by question, not by chapter. Headings double as semantic anchors. Front-matter metadata captures audience, jurisdiction, and product scope so the retrieval layer can filter before it ranks. This is editorial work, and it is the work that determines how the model performs in the wild.

Recommended Actions

The 30-day sweep

  • Week 1 — Inventory. Crawl every source the model will see. Hash the documents and identify duplicates. Tag canonical versus derived. Produce a single register the editorial team can work from.
  • Week 2 — Resolve. Retire duplicates. Resolve policy conflicts in writing rather than in retrieval logic. Stamp owner and last-reviewed date on every canonical document.
  • Week 3 — Restructure. Re-chunk the top hundred documents by traffic into question-shaped sections. Add front-matter metadata that the retrieval layer can use.
  • Week 4 — Evaluate. Run a 200-question golden set against the cleaned knowledge base. Compare scores to the dirty baseline. Decide on go or no-go for the model layer.

Closing

The knowledge base is the prompt. The four weeks spent on hygiene compound into every answer the system produces for the next two years. The model layer becomes simpler, cheaper, and more accurate because the upstream work was done first. Teams that invest in the sweep stop arguing about prompts and start arguing about the work the system is making possible, which is the conversation worth having.