Any workaround to not re-process full prompt on each turn with hybrid attention models running on CPU?

Posted by Quagmirable@reddit | LocalLLaMA | View on Reddit | 9 comments

Hi there, basically as the title says, with Qwen3-VL-30B-A3B and the latest llama.cpp on my CPU-only setup it quickly answers follow-up questions using the cache. But with Qwen3.5 and Gemma4 it always shows forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055. I'm aware that in many cases that happens because the responses were too short and the caching window needs to be adjusted, but it appears that the issue when running only on CPU is different. I've tried flags like --swa-full --flash-attn off but they make no difference. I'm having trouble distinguishing the real issue with all the noise, because apparently this was a problem for most/all users [1] [2], but it seems to have been fixed for GPU setups.