Would a “Knowledge Coverage Audit” tool be useful for RAG/chatbot builders?
Posted by Thesmellofstupid@reddit | LocalLLaMA | View on Reddit | 4 comments
When people build custom GPTs or RAG pipelines, they usually just upload everything because it’s not clear what the base model already covers. That creates two problems:
-
Redundancy – wasting time/vector DB space chunking stuff the model already knows (basic definitions, Wikipedia-tier knowledge).
-
Missed value – the real differentiator (local regs, proprietary manuals, recency gaps) doesn’t always get prioritized.
The idea: a lightweight tool that runs a structured “knowledge coverage audit” against a topic or corpus before ingestion. • It probes the base model across breadth, depth, recency. • Scores coverage (e.g., “Beekeeping basics = 80%, State regulations = 20%, Post-2023 advisories = 5%”). • Kicks out a practical report: “Skip general bee biology; do upload state regs, kit manuals, and recent advisories.”
Basically, a triage step before RAG, so builders know what to upload vs. skip.
Questions: • Would this actually save you time/compute, or do you just upload everything anyway? • For those running larger projects: would a pre-ingestion audit be valuable, or is the safer path always “dump the full corpus”?
Curious if this is a real pain point for people here, or if it’s just over-thinking.
jannemansonh@reddit
Interesting thought, will check again in a few days. RemindMe 5 days!
Thesmellofstupid@reddit (OP)
Yeah part of me feels like I am conducting my own thought experiment, without having more knowledge and experience in this space - but another part of me naturally gets LLMs and how they can overlay RAGs and drive efficiency - I am very much a language person and also very much a process aware person so I have been thinking a lot. I def appreciate you chiming in. I know Needle’s whole thing is making RAG pipelines cleaner and more production-ready — this idea is a layer before that: figuring out what the base model already handles so teams don’t waste time/space dumping redundant docs. Feels like it could complement what tools like Needle are doing, since it’s more about triage than infrastructure. Curious if that resonates with what you see in enterprise projects.
Lesser-than@reddit
I am a firm believer that if you want a factual answer you need to provide the facts. LLM's so far are not a store of knowledge like an encyclopedia. They are more like a thesaurus/dictionary on steroids.
Thesmellofstupid@reddit (OP)
Totally agree LLMs aren’t encyclopedias — they’re pattern engines. The idea here isn’t to “read the model’s memory,” but to probe where it tends to be strong (e.g., basics) vs. weak (e.g., local regs, recency). That way builders know what docs to actually prioritize