Benchmarking Extraction Accuracy: Why I'm moving back to Chunking despite 1M+ Context Windows

Posted by Downtown-Mixture5555@reddit | LocalLLaMA | View on Reddit | 6 comments

Hey everyone,

​I’ve been working on a stealth project that requires high-precision fact extraction from technical documents (think 50-100 pages). While I’ve been using Gemini’s long-context capabilities, the "Lost in the Middle" effect is proving to be a major hurdle for production-grade reliability.

​Even if the needle-in-a-haystack test passes, the "all-needles-in-the-haystack" (extracting every key fact) fails as the context grows. I'm seeing a drop from \~95% accuracy at 10k tokens down to \~82% at 50k tokens, specifically concentrated in the middle of the document.

​The Strategy:

I'm shifting to a "Chunk, Extract, and Merge" architecture.

​The Pro: Accuracy stays high because the model only "sees" 10-15 pages at a time.

​The Con: Contextual awareness across the whole document (e.g., a concept defined on page 2 and used on page 40) is harder to manage.

​The Citation Problem:

I'm also struggling with "verbatim" grounding. Models are great at pointing to the correct page, but they love to "hallucinate-paraphrase" when asked for direct quotes.

​Has anyone found a way to guarantee 100% verbatim grounding in long-context calls, or is "Navigate to Page X" the only honest UI/UX?

​Curious to hear how others are balancing token costs vs. extraction reliability in high-stakes domains.