Any workaround to not re-process full prompt on each turn with hybrid attention models running on CPU?
Posted by Quagmirable@reddit | LocalLLaMA | View on Reddit | 9 comments
Hi there, basically as the title says, with Qwen3-VL-30B-A3B and the latest llama.cpp on my CPU-only setup it quickly answers follow-up questions using the cache. But with Qwen3.5 and Gemma4 it always shows forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055. I'm aware that in many cases that happens because the responses were too short and the caching window needs to be adjusted, but it appears that the issue when running only on CPU is different. I've tried flags like --swa-full --flash-attn off but they make no difference. I'm having trouble distinguishing the real issue with all the noise, because apparently this was a problem for most/all users [1] [2], but it seems to have been fixed for GPU setups.
New-Inspection7034@reddit
Curious what your use case is here — are you running this for agentic/multi-turn work through a tool you built, or using something like Open WebUI / LM Studio? Trying to understand if the reprocessing cost is killing you on long system prompts or if it's more the latency on follow-up turns. I'm dealing with it in my own agentic coding tool by managing context aggressively. I'm using threshold-based compaction that fires before the cache thrash gets bad, targeting a watermark so you're not reprocessing a bloated context every turn. It doesn't eliminate the reprocessing cost but it keeps the context lean enough that it's tolerable.
Quagmirable@reddit (OP)
Hi there, I'm just using the
llama-serverweb interface for an initial bigger task, with followup questions after that's completed.Several-Tax31@reddit
There are multiple things: 1) Add -ctx-checkpoints 128 to your command. 2) If using agentic frameworks, make sure the framework doesn't embed extra information (like time or whatever) between prompts. (Claude code does that) Swa works with prompt similarity, so this entirely breaks it and forces reprocessing. 3) Multiple tool callings breaks patterns in my case. In my case, after 10-15 tool callings, it forces reprocessing no matter what. (I couldn't find the solution to that)
In CPU inference, it's really painful. You got to wait for half an hour nothing. Playing with batch sizes increases prompt processing speed and reduces the waiting time (this reduces the token generation speed though, they are inversely related, so choose between them)
Quagmirable@reddit (OP)
Thanks! This definitely works.
It still feels like something that llama.cpp needs to optimize though, because Qwen3.5 was behaving just like this too shortly after release, and now that I'm trying it a few months later it has been fixed, apparently in llama.cpp.
Several-Tax31@reddit
I know, there are many things that needs improvements in llama.cpp. For cpu inference, you can also check ik_llama, it's generally faster than mainline llama.cpp and they are quicker with some optimizations.
New-Inspection7034@reddit
The biggest difference for me was that with ik_llama.cpp, the model stayed in vram. Mainline llama.cpp spilled into RAM.
Farmadupe@reddit
Having run qwen3.5 models and gemma4 models in fresh llama.cpp builds (as in compiled literally yesterday), cache reuse is working for me.
If either is still broken for you you should contact the llama.cpp Devs on github
Several-Tax31@reddit
It happens from time to time. It's a problem that's not entirely fixable. There are workarounds, and they work most of the time, but not always.
Quagmirable@reddit (OP)
Thanks. Are you running on a GPU or just CPU?