Knowledge Base Audit Checklist Before AI Go-Live

Summary

When a retrieval system disappoints, teams blame the model and burn a quarter tuning prompts and swapping embeddings. The real constraint sits upstream, in a corpus full of duplicates, stale versions, and policies that flatly contradict each other. No prompt can reconcile two documents that disagree. A four-week editorial sweep run before the model connects, de-duplicate, resolve conflicts, stamp ownership, re-chunk by question, then score against a golden set, commonly lifts answer quality from the low 60s to the mid 80s. It is unglamorous work, which is exactly why it is the highest-leverage move you have.

Context

The knowledge base is the model's real prompt

When a retrieval-augmented system disappoints, the team almost always blames the model. They tune prompts, swap the embedding model, and bolt on a re-ranker. The answers improve by a few points, everyone declares partial victory, and the program moves on. The real constraint sat upstream the whole time. The corpus feeding the system carries duplicates, stale versions, conflicting policies, broken metadata, orphaned drafts, and content chunked by chapter when questions arrive by topic. No prompt can reconcile two documents that flatly disagree, and no re-ranker can promote a fresh answer that was never marked as fresh.

A four-week sweep run before the model is connected produces a larger jump in answer quality than a quarter of prompt engineering. On a 200-question golden set, a dirty corpus commonly scores in the low 60s while the same content, swept, clears the mid 80s, without touching the model at all. The work is unglamorous editorial hygiene rather than machine learning, which is exactly why it gets skipped and exactly why it is the highest-leverage move in the program. It is also cheaper to fund. Four weeks of an editor and a data analyst costs a fraction of the six months of engineering time teams burn tuning a model that was never the constraint, and the sweep leaves behind a durable asset: a clean, owned corpus that keeps paying off long after the current model is retired.

Why the model cannot fix this for you

A language model gives stale content the same confidence it gives fresh content, and it gives a duplicate the same weight it gives the canonical source. Confidence is a property of fluent text, not of correctness. So when two versions of a policy both retrieve, the model picks one, sounds certain, and is wrong a predictable fraction of the time. You cannot prompt your way out of a corpus that contradicts itself. This is also why swapping to a larger or newer model rarely moves the number: a stronger reader of a self-contradicting library still has to guess which contradiction to trust. The leverage sits in the library, not the reader. Every hour spent making the corpus internally consistent pays off across every question the system will ever answer, while every hour spent on the model pays off only until the next model release resets the baseline.

The framework

Three disciplines and a four-week plan

The sweep rests on three disciplines and runs on a fixed four-week calendar. The disciplines are single-source-of-truth, the freshness contract, and retrieval-aware structure. The calendar turns them into concrete weekly outputs, each with a measure that tells you whether the week actually moved the corpus. Run the weeks in order, because restructuring content you have not yet de-duplicated just multiplies the mess.

Week	Focus	Key actions	Output	Success measure
Week 1	Inventory	Crawl every source; hash documents; tag canonical vs derived	Single content register	100% of sources hashed and classified
Week 2	Resolve	Retire duplicates; settle policy conflicts in writing; stamp owner and review date	Conflict-free canonical set	Zero unresolved policy conflicts
Week 3	Restructure	Re-chunk top 100 documents by question; add front-matter metadata	Retrieval-aware corpus	Top 100 by traffic re-chunked
Week 4	Evaluate	Run 200-question golden set; compare to dirty baseline; go or no-go	Scored evaluation report	Golden-set score above agreed bar

Stale content is worse than no content, because the model gives it the same confidence as the truth.

A worked example

A support team pointed a chatbot at 3,200 help articles and saw 61 percent golden-set accuracy. The Week 1 inventory found 540 near-duplicates and 47 articles describing three different refund windows. Week 2 collapsed the duplicates to 2,600 canonical documents and settled the refund policy in one owned source. Week 3 re-chunked the 100 highest-traffic articles from long chapters into question-shaped sections and tagged each with product and region. Week 4 re-scored the same golden set at 87 percent. The model, the embeddings, and the prompt never changed. The corpus did.

Recommended actions

Running the sweep so it sticks

Freeze the model layer for the four weeks. Ban prompt and embedding changes so the golden-set improvement is unambiguously attributable to the corpus, not to a coincidental model tweak.
Name a canonical document for every topic the system will answer about, and demote everything else to derived, so the model never has to arbitrate between two sources of truth.
Write a freshness contract per canonical document: an owner, a refresh cadence, and a last-reviewed date enforced by automation that flags or drops anything past its window.
Re-chunk by question rather than by chapter for at least the top 100 documents by traffic, and use headings as semantic anchors so a retrieved section answers a real query on its own.
Build a 200-question golden set from actual user queries before you start, score the dirty baseline, and treat the Week 4 re-score as the go or no-go gate for connecting the model.

Common pitfalls

How hygiene programs backslide

Tuning the model instead of the corpus. Fix: freeze prompts and embeddings for the sweep so the fix lands where the constraint actually is.
Resolving policy conflicts inside retrieval logic. Fix: settle the conflict in writing in one canonical document, so the truth lives in content, not in code no editor can see.
Chunking by chapter because that is how documents are written. Fix: re-chunk by question, because the retrieval layer matches queries to sections, not to table-of-contents structure.
A one-time sweep with no freshness contract. Fix: assign owners and enforced review dates, or the corpus rots back to its dirty baseline within two quarters.
Declaring victory without a golden set. Fix: score before and after against real queries, because without a baseline number the improvement is a feeling, not a result.

Quick-win checklist

Hygiene moves for the next 30 days

Crawl and hash every source the model will see, and tag each document canonical or derived.
Retire the duplicate set and resolve the top ten policy conflicts in owned, canonical documents.
Stamp an owner and a last-reviewed date on every canonical document.
Re-chunk the 100 highest-traffic documents into question-shaped sections with front-matter metadata.
Score a 200-question golden set against both the dirty and the swept corpus and record the delta.

Knowledge-Base Hygiene: A 30-Day Sweep Before AI Goes Live

The knowledge base is the model's real prompt

Why the model cannot fix this for you

Three disciplines and a four-week plan

A worked example

Running the sweep so it sticks

How hygiene programs backslide

Hygiene moves for the next 30 days

This is a taste. The full library goes deeper.

Stratenity is the AI Operating System for Strategic Execution^™.

Knowledge-Base Hygiene: A 30-Day Sweep Before AI Goes Live

The knowledge base is the model's real prompt

Why the model cannot fix this for you

Three disciplines and a four-week plan

A worked example

Running the sweep so it sticks

How hygiene programs backslide

Hygiene moves for the next 30 days

Found this useful? Pass it on.

This is a taste. The full library goes deeper.

Stratenity is the AI Operating System for Strategic Execution™.

Stratenity is the AI Operating System for Strategic Execution^™.