An Open Benchmark for Testing RAG on Realistic Company-Internal Data

Posted by Weves11@reddit | LocalLLaMA | View on Reddit | 12 comments

An Open Benchmark for Testing RAG on Realistic Company-Internal Data

We built a corpus of 500,000 documents simulating a real company, and then let RAG systems compete to find out which one is the best.

--

Introducing EnterpriseRAG-Bench, a benchmark for testing how well RAG systems work on messy, enterprise-scale internal knowledge.

Most RAG benchmarks are built on public data: Wikipedia, web pages, papers, forums, etc. That’s useful, but it doesn’t really match what a lot of people are building against in practice: Slack threads, email chains, tickets, meeting transcripts, PRs, CRM notes, docs, and wikis.

So we tried to generate a synthetic company that behaves more like a real one.

The released dataset simulates a company called Redwood Inference and includes about 500k documents across:

The part we spent the most time on was not just “generate a lot of docs.” It was the methodology for making the docs feel like they belong to the same company.

At a high level, the generation pipeline works like this:

  1. Create the company first We start with a human-in-the-loop process to define the company: what it does, its products, business model, teams, initiatives, market, internal terminology, etc.
  2. Generate shared scaffolding From there we generate things like high-level initiatives, an employee directory, source-specific folder structures, and agents.md files that describe what documents in each area should look like. For example, GitHub docs in the released corpus are pull requests and review comments, not random GitHub issues.
  3. Generate high-fidelity project documents We break company initiatives into smaller projects/workstreams. Each project gets a set of related docs across sources: PRDs, Slack discussions, meeting notes, tickets, PRs, customer notes, etc. These documents are generated with awareness of each other, so you get realistic cross-document links and dependencies.
  4. Generate high-volume documents more cheaply For the bulk of the corpus, we use topic scaffolding by source type. This prevents the LLM from collapsing into the same few themes over and over. In a naive experiment, when we asked an LLM to generate 100 company docs with only the company overview, over 40% had a very close duplicate/sibling. The topic scaffold was our way around that.
  5. Add realistic noise Real enterprise data is not clean, so we intentionally add:
  6. randomly misplaced docs
  7. LLM-plausible misfiled docs
  8. near-duplicates with changed facts
  9. informal/misc files like memes, hackathon notes, random assets, etc.
  10. conflicting/outdated information
  11. Generate questions designed around retrieval failure modes The benchmark has 500 questions across 10 categories, including:
  12. simple single-doc lookups
  13. semantic/low-keyword-overlap questions
  14. questions requiring reasoning across one long doc
  15. multi-doc project questions
  16. constrained queries with distractors
  17. conflicting-info questions
  18. completeness questions where you need all relevant docs
  19. miscellaneous/off-topic docs
  20. high-level synthesis questions
  21. unanswerable questions
  22. Use correction-aware evaluation At 500k docs, it is hard to guarantee the original gold document set is perfect. So the eval harness can consider candidate retrieved documents, judge whether they are required/valid/invalid, and update the gold set when the evidence supports it.

A couple baseline findings from the paper:

The repo includes the dataset, generation framework, evaluation harness, and leaderboard:

https://github.com/onyx-dot-app/EnterpriseRAG-Bench

Would love feedback from other people building RAG/search systems over internal company data. In particular, I’m curious what retrieval setups people think would do best here: hybrid search, rerankers, agents, metadata filters, query rewriting, graph-style traversal, etc.