What causes Out Of Order Elocution?
Posted by MushroomCharacter411@reddit | LocalLLaMA | View on Reddit | 6 comments
Yes it's a pun on Out Of Order Execution in a CPU pipeline, but it is describing a real phenomenon: when the LLM manages to say all the right buzzwords, but it puts then in completely the wrong order so that all of a sudden a bunch of information is being misattributed.
For example, I say person A has trait 1, person B has trait 2, and person C has trait 3. The LLM is remembering all three names and all three traits, but it is pairing them up incorrectly such as linking Person A with trait 2, Person B with trait 3, and Person 3 with trait 1. Sometimes it does this after a long stretch of keeping these associations straight, and then it just sort of shits the bed.
So what are some likely causes of it doing this, and what (if any) are the fixes?
MushroomCharacter411@reddit (OP)
Weird to be following up to myself, but a fair bit has changed in the last week—namely Gemma 4. The Gemma-4-26B-A4B model seems to be less prone to this than Qwen3.5-35B-A3B. Gemma has a few problems of its own, like getting "stuck" while composing its replies, forcing me to stop the generation and repeat the prompt.
Note that Gemma-4-E4B (a 9B parameter model) is still so small as to be useless IMHO—not a meaningful improvement over similar sized Qwen 3.5 models. But in the mid-size category, Gemma just seems to eat Qwen's lunch, and Qwen seemed to be eating everyone else's lunch a month ago.
Responsible-Stock462@reddit
The longer your context the higher is the risk if getting a 'lost in the middle '. You can shorten the conversation, e. g. Put a comprehension in the prompt.
So if you write a book, instead of having a whole chapter in the context window you put up a comprehension of that chapter and ask for the next one.
MushroomCharacter411@reddit (OP)
And to think I did a bunch of work to max out the context window at 262144. It was at the point where I was forced to do a "summarize and feed forward" reboot every day with a 65536 context window, and then by the time the LLM got done asking me clarifying questions about the feed-forward summary, the context window was already 40% full.
I had to quantize the K and V caches all the way down to Q4_1 to fit it into my hardware, on top of using a Q4_K_M model, so most of the time it does alright but sometimes it just completely loses the plot.
Responsible-Stock462@reddit
I have used a minimax 2.1 (the old one) for story telling. It was lost after chapter 6 it was like a person with dementia (sorry). You can put the really important things in the system prompt, but it's probably the same as if having it in the normal prompt
MushroomCharacter411@reddit (OP)
I'm using Qwen-3.5-35B-A3B-Claude-4.6-Opus at Q4_K_M quantization and now I'm using a Q5_1 quantization for the K and V caches (because llama.cpp crashes if they're not both the same). This is allowing me a context window of 204800. Things honestly weren't much different using Q8_0 or even F16 for the context caches, except they filled up faster.
maz_net_au@reddit
LLMs just do that. They generate plausible sounding tokens, "correctness" isn't a concept they work on, just "probable"