Is long re-processing of output as input a common "feature" or not?
Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 30 comments
I now use (mostly) Gemma 4 and Qwen 3.5 models. And seems that all of them, after context grows a bit, after providing long output for me and getting a short prompt in response, are starting to process many new tokens as input and I have to wait long for new output to start.
I am using koboldcpp *, maybe on llama.cpp it works differently. I wonder, when the engine produces all this output, does it not calculate KV cache or something to use it on the next turn when output becomes part of the story / input? How does it work internally?
TIA
* with Q4-Q5 GGUFs and usually q4 for KV cache of ~130k.
am17an@reddit
Use preserve thinking, reasoning budget 2048
alex20_202020@reddit (OP)
In koboldcpp preserve options: None / Newest / All
Budget is in Low/Normal/Medium/High something. Is 2048 close to which meaning?
D2OQZG8l5BI1S06@reddit
Good harnesses strip the thinking to not pollute context, so it needs to recompute everything since the last prompt.
floconildo@reddit
Something could’ve changed in the order of your input that invalidated your cache. llama.cpp is very sensitive to that (order of required parameters in functions, order of keys in the payload, etc) so it’s hard to pinpoint exactly what could be the issue without further information.
alex20_202020@reddit (OP)
I do that very rarely. When I have changed mode of saving thinking it re-did 90k tokens - all story. But it (almost) always process hundreds of tokens as input for couple of sentences of a new prompt.
floconildo@reddit
Oh, I now what's up! The cache is being invalidated because of the reasoning content, thus it's trying to reprocess from the last checkpoint.
In llama.cpp you can actually see this behavior clearly in the logs, something like this:
This was an issue with Qwen3.5 where thinking wasn't preserved properly and Qwen3.6 fixed that with the parameter
{"preserve_thinking": true}. I'm not a koboldcpp user but afaik it uses llama.cpp under the hood, so it should work the same.Check this bit from unsloth.
alex20_202020@reddit (OP)
Today in particular for Gemma 4 I have set "Include all thinking" (in CoT submitted to the AI, was "only newest thinking") in koboldcpp GUI (which resulted in re-read of 90k tokens). After that it continues to process what looks like last input by size - in addition to the new prompt.
LetsGoBrandon4256@reddit
btw that's against Google's recommendation for Multi-turn conversation
alex20_202020@reddit (OP)
Interesting, I did it because of recommendation for Qwen 3.6 IIRC. Now, do I "better" keep notes of recommendations for all models?
I just run koboldcpp and hope it does "best" settings, I do not change temperature etc. For Gemma 4 26B Thinking was OFF initially, I saw the model repeatedly failing to fix IMO easy points in code so I tried turning Thinking ON (and a bit later to add all thinking to CoT).
LetsGoBrandon4256@reddit
Totally understandable, it's a gotcha moment since the two model have the opposite preference when it comes to sending back past reasoning as context.
Unsloth has a very concise model document on their website https://unsloth.ai/docs/models/gemma-4#recommended-settings That should give you all the parameters for a proper setup.
The rest is up to your front end. All recent front-end should allow you to create presets that you can switch between easily. The initial set-up is the most annoying part since everyone does things fucking differently and some front end might just don't offer the option to adjust certain things. For example Open WebUI always send back the reasoning so I had to write a filter function with regex to strip out the reasoning on the fly.
floconildo@reddit
Yep, it changes per model, so might as well be the case.
r1str3tto@reddit
I just discovered that putting the current timestamp in the system prompt is a Bad Idea for this very reason. Kind of obvious in hindsight…
o0genesis0o@reddit
Something is invalidating your prompt prefix cache. Ideally, there should be very little if not none re-processing of the previous communication history. When I build my own agent framework, the golden rule (for me) is never ever mutate the history to avoid invalidating cache, because I know that it would not be usable for local model and expensive for cloud model (cache hit is very cheap vs cache miss).
With the current batch of model, only Nvidia Nemotron 4B has some problem with llamacpp that break its caching. Other models are perfectly okay. If prompt keeps being reprocessed, check the software that you use to send message to LLM and see if they mutate the history or change the tool list or something like that.
audioen@reddit
By default llama.cpp caches prompts. I've seen that sometimes the issue is resolved by disabling parallel processing support, e.g. only one single inferencing context is allowed, if you specify --parallel 1. The specific problem I hit into was a timeout, triggering llama.cpp to choose another context, and starting the prompt over in there. It smelled like a bug, but something like 1-2 months ago, it could completely wedge a coding agent into neverending context reprocessing loop after first timeout when reading a large file.
You can also try to set --cache-reuse=256 or something, which attempts to identify opportunities to shift the KV cache of the model. It might work with Gemma, but probably doesn't work with Qwen.
There's --cache-ram which serves as dumping ground for the current KV cache. Default size is 8 GB. It may be too small. In my case, with unified RAM computers, I don't want this feature so I set --cache-ram 0, which causes the context to live entirely in VRAM as context checkpoints or just pre-existing context, depending on model. I actually saw failure models where running out of cache-ram lost active context and forced unnecessary reprocessing, so I'm not convinced about this feature at all. In my opinion, cache-ram should be stored on disk, where it can be read very fast and can be extremely large, even over 100 GB, so that dozens after dozens of different prompt prefixes would be available for models to use. Putting it into RAM, which is very limited on unified VRAM system most of the time, is not what I want personally.
Those are the tips and pointers I know about. I have not seen any prompt reprocessing issue with --parallel, but that's also partly because I now have --timeout 3600 everywhere which sets up 1h timeouts on things like prompt processing, so I simply don't hit that failure mode anymore. However, I still hit into unwelcome and undesired timeouts in various agent software.
For instance, Pi kills vllm tool calls after 5 minutes because vllm can't stream toolcall results and writing a large enough file can take over 5 minutes, which completely stalls the agent. Similarly, it default to maximum reply length of 16384 which is not sufficient when writing large files. I have battled with timeouts in opencode, roo code, and many other software. Usually a random hunt through the codebase and git repo and doc finds these various timeouts and unwanted token limits, which at least to me are now the biggest problem in local model usage. The models are good enough, even if a little slow; the too tight timeouts and other limits which have over years appeared in the software and libraries are now the biggest hurdle left.
Farmadupe@reddit
Cache reuse doesn't work wiwth either model ass it's incompatible with recurrent (qwen3.5) and SWA (gemma4).
But even then, the KV shifting is an awkward thing to do, as the rsulting shifted KV is totally "incorrect" by any standard. It's for pretending that the deleted parts of KV were still there, which unless you're deleting a thinking trace, is almost definitely what you wanted (if you deleted a prefix, you probably deleted it for a reason and you'd rather than your inference engine didn't pretend it was stilll there)
audioen@reddit
I think --full-swa might make it work with Gemma? I thought so at least, it seemed like it got recently fixed.
Farmadupe@reddit
I've not recompiled for a while, so I may be out of date, but presumably reverting the SWA to full KV would allow KV shifting to work?
I'll try and explain the issue with KV shifting and why it's disabled by default... Each token that travels through the model is shaped by all the tokens that came before it. if your system pormpt (or the first turn) contains the instruction "ignore all instances of the word cat", then if you wrote the word "cat", its token would reshaped by the layers of the transformer into looking like some sort of a "I-dont-look-like-a-cat-so-im-easier-to-ignore" token as it got recorded into KV cache.
So then if you delete that instruction and want cats to no longer be ignored, with KV shifting, those "I-dont-look-like-a-cat-so-im-easier-to-ignore" tokens still look exactly as they were, and the transformer may carry on ignoring "cat" even though you erased that instruction,
And, what's worse, if you then did do full reprocessing, all of a sudden the model would start responding to "cat" again.
HopePupal@reddit
you can easily override OpenCode timeouts (overall and chunk) per model, in the config file. i'd be surprised if Pi doesn't have that feature, but if it doesn't, it's one of the easiest agents to understand and patch.
audioen@reddit
I know that there are those options, but it's not a great user experience. In case of Pi:
* it sends developer role, doesn't work in official quants
* it has a peculiar thinkingFormat or such key for setting the preserve_thinking: "on" rather than simple KV override
* it defaults to 16384 token replies for some reason
* it has that annoying 5 minute death issue with vllm tool calls or did have -- I saw something about this possibly being fixed in changelog.
This is all whole bunch of simple stuff, there's just a lot of it, and when you hit each problem in sequence, it takes a while to iron out the kinks.
Farmadupe@reddit
Currently llama.cpp prompt cache reuse situation is unspectacular. Depending on settings, it can:
So it's kinda not where we'd like it to be at the moment. llama.cpp does have KV cache reuse, but usability is very hit and miss.
floconildo@reddit
I'm not sure I agree. Of course there's plenty of improvements on the cache side of llama.cpp, but atm it works simple enough that you can optimize given that you know how prompt processing works — and I can certainly tell you that I didn't.
And with the checkpoint system I've been able to have parallel conversations and keep my cache intact, resuming most of the time from where we left off instead of triggering a full reprocess.
Farmadupe@reddit
Have you ever noticed when llama.cpp exhausts the available context for you sequence that the network call just... hangs forever? There's lots f rough edges in it as a piece of software.
Don't get me wrong, I love llama.cpp because it lets me run great models when no other engine could. But that doesn't mean we should ignore its flaws, now should we discard it as useless.
floconildo@reddit
Oh I meant specifically about the cache. I agree with you on the rough edges, just saying that simple sometimes is best if it leads to better predictability on its behavior.
Enough_Big4191@reddit
yes, it’s common with long context. models reprocess previous outputs, which causes delays. llama.cpp might handle it better, but with long context (like 130k), re-processing is still a bottleneck. optimizing KV cache or trimming context could help speed things up.
alex20_202020@reddit (OP)
Why as context size grows the need to "reprocess previous outputs" appears? How long context differ?
DeltaSqueezer@reddit
Not if you manage your context properly. Verify everything in the chain from your prompt down to your engine.
alex20_202020@reddit (OP)
Verify how? I use koboldcpp with its own web GUI and just GGUF models, no agents setup. I guess check source code then (but I know C lang / html badly), what else?
DeltaSqueezer@reddit
Check what context is being passed at each stage. Check server logs. I've seen breaks due to UI adding time or other items into the context. Or running a separate request to create a title or do tagging which kills slot on llama.cpp
alex20_202020@reddit (OP)
Thanks for the explanation. I just tried to compare what logs show in output with how it is added to input. There are the differences:
1) output is just multi-line text (markdown) starts with
Output:, input is in json formatInput: {and uses "\n" instead 2) (similarly?) in input\"is in place of"3) output starts withOutput:<channel|>and ends with just markdown text, input additionally has<turn|> <|turn>modelat the beginning and<turn|> <|turn>userat the endI guess 1 and 2 are just because of json for input. As for (3) - is it to be expected?
Farmadupe@reddit
Currently llama.cpp prompt cache reuse situation is unspectacular, probably for architectural. Depending on settings, it can: * Delete your existing prompt cache because another one came in in the meantime * If it's in the middle of one completion and you branch a conversation, it won't reuse the existing cache because it's "already in use" * If you branch from an earlier point in the conversation to explore a topic, it will delete the cache from your original thread. forcing a big reprocessing when you come back to it * There are options for saving/restoring idle "slots" into system ram they're unsued, but the feature is new, buggy, and would OOM your host, so needs to be used with caution (this may have changed, I've not recompiled in a couple of weeks)