I kept running into cases where retrieval was “working” but the model still gave bad answers

Posted by Silly-Effort-6843@reddit | LocalLLaMA | View on Reddit | 5 comments

I kept running into cases where retrieval was “working” but the model still gave bad answers.

So I tried something simpler:

- take ONE doc

- put it in a clean context

- ask a question that depends entirely on it

- check if the model actually uses the info

Ran this on \~20 docs (schemas, join logic, etc.) with LLaMA (Groq).

What helped a lot:

- shorter docs (<500 words)

- tables > paragraphs

- showing transformations explicitly (e.g. CUST-{id}) instead of describing them

- code examples > explanations

What didn’t work well:

- long narrative explanations

- implicit logic (“the system usually does X…”)

Example:

Postgres → Mongo join failed until I literally wrote the transformation format. After that it worked consistently.

Curious if others are doing something like this, or just relying on retrieval + prompt tuning?

[-]

matt-k-wong@reddit

What sized model were you using? Actually, which exact model were you using? You said shorter docs gave better results and my first impression is that the model might be a small one (low parameter count).

[-]

Silly-Effort-6843@reddit (OP)

Llama 3.3 70B via Groq. So not a small model 70B params with a 128k context window.

But I still think the finding holds even for larger models. The issue isn't that the model can't process long docs it's that longer docs introduce more noise for the model to filter through when answering a specific question. A 200-word table with explicit column types and format examples gives the model less room to hallucinate than a 1500-word narrative explaining the same thing.

The test that made it obvious: a join key mapping doc written as prose ("the customer ID in PostgreSQL is typically an integer, while in the DuckDB tickets table it uses a prefixed format...") failed the injection test. Same info rewritten as a table with a code example (CUST-{padded_int}) passed immediately. Same model, same context window, same question.

Would be curious if you see the same pattern at different param counts though might be that smaller models benefit even more from structured docs.

[-]

matt-k-wong@reddit

yes, it is a universal property (for now) that longer documents and longer context used result in lower quality. If you can chunk then do so. However, Karpathy just went viral with another topic where LLMs parse documents and produce md files and continuously and iteratively improve the "wiki". Note that this is basically just another form of compressing the relevant information into higher density and therefore smaller documents. Did you try changing the context length? I'm certain that there's some tuning you could do.

[-]

Silly-Effort-6843@reddit (OP)

the Karpathy 'LLM wiki' context is exactly what we’re seeing Markdown tables effectively act as a high-density compression format for the agent.

Regarding tuning: we actually locked the injection test context window to 4k to purposefully stress-test information density. If a join key logic can’t survive a 4k window, it definitely won’t survive the noise of a 128k window filled with raw DB logs.

Have you found a 'sweet spot' for density? We're currently debating if deeply nested JSON definitions outperform Markdown tables for multi-step routing logic

[-]

matt-k-wong@reddit

I didn't run experiments but I can confirm that more dense and more organized is better. Actually a lot of people are using obsidian because it is natively graph enhanced but I run my own custom graph enhanced data stores.