Would a “Knowledge Coverage Audit” tool be useful for RAG/chatbot builders?

Posted by Thesmellofstupid@reddit | LocalLLaMA | View on Reddit | 4 comments

When people build custom GPTs or RAG pipelines, they usually just upload everything because it’s not clear what the base model already covers. That creates two problems:

  1. Redundancy – wasting time/vector DB space chunking stuff the model already knows (basic definitions, Wikipedia-tier knowledge).

  2. Missed value – the real differentiator (local regs, proprietary manuals, recency gaps) doesn’t always get prioritized.

The idea: a lightweight tool that runs a structured “knowledge coverage audit” against a topic or corpus before ingestion. • It probes the base model across breadth, depth, recency. • Scores coverage (e.g., “Beekeeping basics = 80%, State regulations = 20%, Post-2023 advisories = 5%”). • Kicks out a practical report: “Skip general bee biology; do upload state regs, kit manuals, and recent advisories.”

Basically, a triage step before RAG, so builders know what to upload vs. skip.

Questions: • Would this actually save you time/compute, or do you just upload everything anyway? • For those running larger projects: would a pre-ingestion audit be valuable, or is the safer path always “dump the full corpus”?

Curious if this is a real pain point for people here, or if it’s just over-thinking.