An Open Benchmark for Testing RAG on Realistic Company-Internal Data

Posted by Weves11@reddit | LocalLLaMA | View on Reddit | 12 comments

We built a corpus of 500,000 documents simulating a real company, and then let RAG systems compete to find out which one is the best.

Introducing EnterpriseRAG-Bench, a benchmark for testing how well RAG systems work on messy, enterprise-scale internal knowledge.

Most RAG benchmarks are built on public data: Wikipedia, web pages, papers, forums, etc. That’s useful, but it doesn’t really match what a lot of people are building against in practice: Slack threads, email chains, tickets, meeting transcripts, PRs, CRM notes, docs, and wikis.

So we tried to generate a synthetic company that behaves more like a real one.

The released dataset simulates a company called Redwood Inference and includes about 500k documents across:

Slack
Gmail
Linear
Google Drive
HubSpot
Fireflies
GitHub
Jira
Confluence

The part we spent the most time on was not just “generate a lot of docs.” It was the methodology for making the docs feel like they belong to the same company.

At a high level, the generation pipeline works like this:

Create the company first We start with a human-in-the-loop process to define the company: what it does, its products, business model, teams, initiatives, market, internal terminology, etc.
Generate shared scaffolding From there we generate things like high-level initiatives, an employee directory, source-specific folder structures, and agents.md files that describe what documents in each area should look like. For example, GitHub docs in the released corpus are pull requests and review comments, not random GitHub issues.
Generate high-fidelity project documents We break company initiatives into smaller projects/workstreams. Each project gets a set of related docs across sources: PRDs, Slack discussions, meeting notes, tickets, PRs, customer notes, etc. These documents are generated with awareness of each other, so you get realistic cross-document links and dependencies.
Generate high-volume documents more cheaply For the bulk of the corpus, we use topic scaffolding by source type. This prevents the LLM from collapsing into the same few themes over and over. In a naive experiment, when we asked an LLM to generate 100 company docs with only the company overview, over 40% had a very close duplicate/sibling. The topic scaffold was our way around that.
Add realistic noise Real enterprise data is not clean, so we intentionally add:
randomly misplaced docs
LLM-plausible misfiled docs
near-duplicates with changed facts
informal/misc files like memes, hackathon notes, random assets, etc.
conflicting/outdated information
Generate questions designed around retrieval failure modes The benchmark has 500 questions across 10 categories, including:
simple single-doc lookups
semantic/low-keyword-overlap questions
questions requiring reasoning across one long doc
multi-doc project questions
constrained queries with distractors
conflicting-info questions
completeness questions where you need all relevant docs
miscellaneous/off-topic docs
high-level synthesis questions
unanswerable questions
Use correction-aware evaluation At 500k docs, it is hard to guarantee the original gold document set is perfect. So the eval harness can consider candidate retrieved documents, judge whether they are required/valid/invalid, and update the gold set when the evidence supports it.

A couple baseline findings from the paper:

BM25 was surprisingly strong, beating vector search on overall correctness and document recall.
Vector search underperformed even on semantic questions, which is interesting because those were designed to reduce keyword overlap.
Agentic/bash-style retrieval had the best completeness, especially on questions where it needed to explore related files, but it was much slower and more expensive.
In general, getting the right docs into context mattered a lot. Once the relevant evidence was retrieved, current LLMs were usually able to produce a good answer.

The repo includes the dataset, generation framework, evaluation harness, and leaderboard:

https://github.com/onyx-dot-app/EnterpriseRAG-Bench

Would love feedback from other people building RAG/search systems over internal company data. In particular, I’m curious what retrieval setups people think would do best here: hybrid search, rerankers, agents, metadata filters, query rewriting, graph-style traversal, etc.

[-]

9gxa05s8fa8sh@reddit

awesome work! document lookup is the next frontier because there's no reason asking an AI to do work blind when tons of documentation for every job exists.

can you add openrag and lightrag?

https://github.com/langflow-ai/openrag

https://github.com/hkuds/lightrag

Chromix_@reddit

Good:

Contains a bit of conflicting information
Sometimes multiple documents are required for a good answer.

Unrealistic:

Most questions are just for single document sources. In practice it often requires lots of (poorly filed) documents to build a satisfying answer.
Does not account for poor document extraction (PDF to markdown, custom Confluence macros, custom Jira fields).
Doesn't have documents with incomplete context that every company has, because "everyone at the company knows".
No outdated (and unmarked) documentation.
Has no documents that are mostly based on images, charts, etc.
The documents seem rather short.
500k docs is not even a mid-size company.
There are just 500 questions. You'll need at least 2k for a somewhat accurate benchmark.

Weves11@reddit (OP)

thanks for the feedback! we definitely acknowledge that there's a lot of shortcomings and things we could've done better with this dataset, but hopefully it's a good enough starting point to build off of. We found ourselves wanting something like this for so long that we decided we just needed to build it 😄

dig1@reddit

Not to rain on the parade, but I find this approach too generic. In my opinion, the value of enterprise RAG lies in extracting specialized knowledge from a small set of high-signal data, as companies rarely maintain detailed documentation across the board.

Processing 500,000 documents is often unrealistic; unless a company has been meticulously documenting everything for more than 40 years, that volume is mostly noise. The same applies to Slack and Gmail - there is too much "chatter" and too little crucial information.

From my experience building legal RAG system, hybrid search (BM25 + vector) is the baseline. Adding a knowledge graph on top (via a graph DB or search) is what makes it truly effective. However, success requires deep, domain-specific tuning. Without it, a generic RAG is often less effective than simply letting Claude or Gemini search Notion and Jira via MCP. My experience might differ in other domains, but specialized knowledge and tuning is usually the deciding factor; otherwise, you'll end up with a system that hallucinates like crazy.

> companies rarely maintain detailed documentation across the board.
we tried our best to create the dataset to best simulate this. We have a separate step in the process to add noise just for this purpose, because we realized that most data in companies is outdated, or low-signal, or just outright noise. definitely not perfect, but we think it is a pretty close approximation

rising_air@reddit

Very interesting approach! 2 questions:

What does the overall column mean? At first i thought its the avg of correctness, completeness and recall, but its not. Why?
Do you really mean recall or context recall? If you mean recall, please write it as recall@k

It's more like overall score is average of completeness gated by correctness. So, if the answer is not correct, the score is 0, but if the answer is correct, the score is the completeness score.
context recall, defined as fraction of expected gold docs that appear in the candidate's submitted document set. its only computed for questions that have expected docs. note this isn't recall@k bc there's no fixed cutoff; systems just declare whatever docs they used as context and recall is measured over that set.

Exact_Guarantee4695@reddit

the bm25 finding is interesting but not totally surprising -- had the same thing happen in a project where we were sure vector search would win on semantic overlap queries and bm25 kept beating it. turned out the documents had enough shared vocabulary that tf-idf signals were actually pretty reliable. the thing that kills vector search on internal docs is usually that embeddings trained on public web text just dont understand company-internal terminology. "project falcon" means nothing to a generic embedding model. the agentic retrieval completeness result is the one id want to dig into more -- curious what the latency/cost penalty looks like at p95 for the harder multi-doc questions.

scottgal2@reddit

Was my whole approach with lucidRAG - not because I wanted better wuality but because I needed to use tiny llms so salience was critical. It uses a combination of classic search combined with vector.
Never really productized it but the concept Reduced RAG I built for lucidRAG is the basis for most of my stuff

https://www.mostlylucid.net/blog/reduced-rag

interesting! am definitely gonna dig into this more to see if we had similar results here where the embedding step just didn't understand enterprise jargon

ReceptionBrave91@reddit

wow openclaw beating everything out speaks a lot to the future of RAG, maybe an agent loop with filesystem + hybrid search is the best solution

oh 100%, the real problem here is that an agent loop is just so so slow, so finding a way to make funnel the search space is an important problem to solve