Benchmarking Extraction Accuracy: Why I'm moving back to Chunking despite 1M+ Context Windows
Posted by Downtown-Mixture5555@reddit | LocalLLaMA | View on Reddit | 6 comments
Hey everyone,
I’ve been working on a stealth project that requires high-precision fact extraction from technical documents (think 50-100 pages). While I’ve been using Gemini’s long-context capabilities, the "Lost in the Middle" effect is proving to be a major hurdle for production-grade reliability.
Even if the needle-in-a-haystack test passes, the "all-needles-in-the-haystack" (extracting every key fact) fails as the context grows. I'm seeing a drop from \~95% accuracy at 10k tokens down to \~82% at 50k tokens, specifically concentrated in the middle of the document.
The Strategy:
I'm shifting to a "Chunk, Extract, and Merge" architecture.
The Pro: Accuracy stays high because the model only "sees" 10-15 pages at a time.
The Con: Contextual awareness across the whole document (e.g., a concept defined on page 2 and used on page 40) is harder to manage.
The Citation Problem:
I'm also struggling with "verbatim" grounding. Models are great at pointing to the correct page, but they love to "hallucinate-paraphrase" when asked for direct quotes.
Has anyone found a way to guarantee 100% verbatim grounding in long-context calls, or is "Navigate to Page X" the only honest UI/UX?
Curious to hear how others are balancing token costs vs. extraction reliability in high-stakes domains.
LocalLLaMA-ModTeam@reddit
This post has been marked as spam.
deejeycris@reddit
I think that with a specific agent pipeline, the quoting problem can be mitigated if not solved (though it does not solve initial hallucinations). Essentially you have the model decompose the output into claims, then ask it to provide quotes in support of each claim and check if they're accurate or not. To improve reliability, substring matching of the quotes against the corpus could be used to verify if they're really quoting or hallucinating.
Downtown-Mixture5555@reddit (OP)
You're totally right—trying to force the LLM to generate the claim and ground it with a verbatim quote in a single zero-shot prompt is where the hallucination slips in.
Implementing a deterministic backend step (standard substring matching of the model's proposed quote against the raw text of the cited page) is a highly practical fallback.
Really appreciate this insight! Are you using any specific framework for your agent pipelines (LangChain, LlamaIndex), or just rolling custom orchestration?
deejeycris@reddit
I've been experimenting with Hermes but now switched to pi as I didn't find it's still quite finky and needs more time. Pi is also at an initial stage however I like the minimalistic core/high extensibility, hermes feels bloated already in comparison. So I'd say that I'm trying to roll my own custom orchestration for now unless I find some really good framework that integrates well with an open-source harness.
Downtown-Mixture5555@reddit (OP)
I completely get it. The harness is everything. Honestly, a well-thought-out folder structure and clean data pipeline is as good as any 'agent' out there right now.
AstraMythos@reddit
Chunking's a smart move for keeping outputs predictable and reducing oversight headaches in long contexts. Make sure you're logging those benchmarks to catch any drift early it could save you from bigger governance pitfalls down the line.