An Open Benchmark for Testing RAG on Realistic Company-Internal Data
Posted by Weves11@reddit | LocalLLaMA | View on Reddit | 12 comments
We built a corpus of 500,000 documents simulating a real company, and then let RAG systems compete to find out which one is the best.
--
Introducing EnterpriseRAG-Bench, a benchmark for testing how well RAG systems work on messy, enterprise-scale internal knowledge.
Most RAG benchmarks are built on public data: Wikipedia, web pages, papers, forums, etc. That’s useful, but it doesn’t really match what a lot of people are building against in practice: Slack threads, email chains, tickets, meeting transcripts, PRs, CRM notes, docs, and wikis.
So we tried to generate a synthetic company that behaves more like a real one.
The released dataset simulates a company called Redwood Inference and includes about 500k documents across:
- Slack
- Gmail
- Linear
- Google Drive
- HubSpot
- Fireflies
- GitHub
- Jira
- Confluence
The part we spent the most time on was not just “generate a lot of docs.” It was the methodology for making the docs feel like they belong to the same company.
At a high level, the generation pipeline works like this:
- Create the company first We start with a human-in-the-loop process to define the company: what it does, its products, business model, teams, initiatives, market, internal terminology, etc.
- Generate shared scaffolding From there we generate things like high-level initiatives, an employee directory, source-specific folder structures, and agents.md files that describe what documents in each area should look like. For example, GitHub docs in the released corpus are pull requests and review comments, not random GitHub issues.
- Generate high-fidelity project documents We break company initiatives into smaller projects/workstreams. Each project gets a set of related docs across sources: PRDs, Slack discussions, meeting notes, tickets, PRs, customer notes, etc. These documents are generated with awareness of each other, so you get realistic cross-document links and dependencies.
- Generate high-volume documents more cheaply For the bulk of the corpus, we use topic scaffolding by source type. This prevents the LLM from collapsing into the same few themes over and over. In a naive experiment, when we asked an LLM to generate 100 company docs with only the company overview, over 40% had a very close duplicate/sibling. The topic scaffold was our way around that.
- Add realistic noise Real enterprise data is not clean, so we intentionally add:
- randomly misplaced docs
- LLM-plausible misfiled docs
- near-duplicates with changed facts
- informal/misc files like memes, hackathon notes, random assets, etc.
- conflicting/outdated information
- Generate questions designed around retrieval failure modes The benchmark has 500 questions across 10 categories, including:
- simple single-doc lookups
- semantic/low-keyword-overlap questions
- questions requiring reasoning across one long doc
- multi-doc project questions
- constrained queries with distractors
- conflicting-info questions
- completeness questions where you need all relevant docs
- miscellaneous/off-topic docs
- high-level synthesis questions
- unanswerable questions
- Use correction-aware evaluation At 500k docs, it is hard to guarantee the original gold document set is perfect. So the eval harness can consider candidate retrieved documents, judge whether they are required/valid/invalid, and update the gold set when the evidence supports it.
A couple baseline findings from the paper:
- BM25 was surprisingly strong, beating vector search on overall correctness and document recall.
- Vector search underperformed even on semantic questions, which is interesting because those were designed to reduce keyword overlap.
- Agentic/bash-style retrieval had the best completeness, especially on questions where it needed to explore related files, but it was much slower and more expensive.
- In general, getting the right docs into context mattered a lot. Once the relevant evidence was retrieved, current LLMs were usually able to produce a good answer.
The repo includes the dataset, generation framework, evaluation harness, and leaderboard:
https://github.com/onyx-dot-app/EnterpriseRAG-Bench
Would love feedback from other people building RAG/search systems over internal company data. In particular, I’m curious what retrieval setups people think would do best here: hybrid search, rerankers, agents, metadata filters, query rewriting, graph-style traversal, etc.
9gxa05s8fa8sh@reddit
awesome work! document lookup is the next frontier because there's no reason asking an AI to do work blind when tons of documentation for every job exists.
can you add openrag and lightrag?
https://github.com/langflow-ai/openrag
https://github.com/hkuds/lightrag
Chromix_@reddit
Good:
Unrealistic:
Weves11@reddit (OP)
thanks for the feedback! we definitely acknowledge that there's a lot of shortcomings and things we could've done better with this dataset, but hopefully it's a good enough starting point to build off of. We found ourselves wanting something like this for so long that we decided we just needed to build it 😄
dig1@reddit
Not to rain on the parade, but I find this approach too generic. In my opinion, the value of enterprise RAG lies in extracting specialized knowledge from a small set of high-signal data, as companies rarely maintain detailed documentation across the board.
Processing 500,000 documents is often unrealistic; unless a company has been meticulously documenting everything for more than 40 years, that volume is mostly noise. The same applies to Slack and Gmail - there is too much "chatter" and too little crucial information.
From my experience building legal RAG system, hybrid search (BM25 + vector) is the baseline. Adding a knowledge graph on top (via a graph DB or search) is what makes it truly effective. However, success requires deep, domain-specific tuning. Without it, a generic RAG is often less effective than simply letting Claude or Gemini search Notion and Jira via MCP. My experience might differ in other domains, but specialized knowledge and tuning is usually the deciding factor; otherwise, you'll end up with a system that hallucinates like crazy.
Weves11@reddit (OP)
> companies rarely maintain detailed documentation across the board.
we tried our best to create the dataset to best simulate this. We have a separate step in the process to add noise just for this purpose, because we realized that most data in companies is outdated, or low-signal, or just outright noise. definitely not perfect, but we think it is a pretty close approximation
rising_air@reddit
Very interesting approach! 2 questions:
What does the overall column mean? At first i thought its the avg of correctness, completeness and recall, but its not. Why?
Do you really mean recall or context recall? If you mean recall, please write it as recall@k
Weves11@reddit (OP)
Exact_Guarantee4695@reddit
the bm25 finding is interesting but not totally surprising -- had the same thing happen in a project where we were sure vector search would win on semantic overlap queries and bm25 kept beating it. turned out the documents had enough shared vocabulary that tf-idf signals were actually pretty reliable. the thing that kills vector search on internal docs is usually that embeddings trained on public web text just dont understand company-internal terminology. "project falcon" means nothing to a generic embedding model. the agentic retrieval completeness result is the one id want to dig into more -- curious what the latency/cost penalty looks like at p95 for the harder multi-doc questions.
scottgal2@reddit
Was my whole approach with lucidRAG - not because I wanted better wuality but because I needed to use tiny llms so salience was critical. It uses a combination of classic search combined with vector.
Never really productized it but the concept Reduced RAG I built for lucidRAG is the basis for most of my stuff
https://www.mostlylucid.net/blog/reduced-rag
Weves11@reddit (OP)
interesting! am definitely gonna dig into this more to see if we had similar results here where the embedding step just didn't understand enterprise jargon
ReceptionBrave91@reddit
wow openclaw beating everything out speaks a lot to the future of RAG, maybe an agent loop with filesystem + hybrid search is the best solution
Weves11@reddit (OP)
oh 100%, the real problem here is that an agent loop is just so so slow, so finding a way to make funnel the search space is an important problem to solve