Can somebody please explain why for some models output get included in prompt tokens processing (possibly related to KV cache)?
Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 7 comments
Th title includes KV cache because I suspect below is related to it. If not, please correct me.
Recently I have run koboldcpp with defaults (ContextShift ON, FastForwarding ON, Sliding Window Attention OFF, SmartCache OFF) except context size (131K) and KV cache quantization (4 bit) and network port.
For Qwen 3.5 and Gemma 4 in logs I see processing prompt (X / Y tokens) lines where Y is often (always?) much larger then my last prompt length (like 1000 tokens for 10-20 words last prompt). And (obviously) long delay before output starts in frontend (KoboldAI Lite). I have noted that usually:
Y ~ length in tokens of Last Output of the Model (from logs) + length of my Last Prompt
Why? How does the engine works? Why during giving of output it has not processed output already or needs to re-process it?
I do not recall Y being much larger than length(my last prompt) for Qwen 3 and Gemma 3. Maybe new models use some KV cache size optimization that effect this? Does engine command parameters (e.g. I listed above) effect that? Do ether engines work for the above same as koboldcpp does?
Below some info from logs:
For Qwen 3.5 9B logs contain "RNN with FF and shifting flags enabled - SmartCache will be enabled with extra slots". llama_KV_cache ~ 1.2 GiB for 131K context with 4bits KV cache.
For Gemma 26B the engine allocates for same parameters ~0.7+7 GiB for KV cache, log lists each layer in llama_KV_cache lines. Logs contain: "using full-size SWA cache", "creating non-SWA cache, size = 131328 cells" (BTW, why not 131072 as context size requested?, also in logs: "n_ctx=131328", "n_ctx_sequence (131328)" "[timestamp] CtxLimit: 1822 / 131072".)
I have thought of a workaround to reduce the delay: immediately submit some dummy prompt, then after new output starts, ABORT in frontend, Undo started response, Undo temp prompt, submit actual prompt. This way while I read the response the engine processes last output. But maybe there is a way to do so automatically, without manual "ABORT, undo" each time?
TIA
LagOps91@reddit
Most likely because only the last think block is included, which triggers reprocessing due to changes in the context.
alex20_202020@reddit (OP)
included in what?
LagOps91@reddit
In the context. Most thinking models are trained to only expect the most recent think block to be included in the context.
alex20_202020@reddit (OP)
I thought the issue is how backend processes input. Can you please explain in more detail.
LagOps91@reddit
Basically you send the model formatted text. If the text doesn't change, you can re-use what the model has cached and only process the new prompt. If you remove text, such as a think block, everything from that point on needs to be reprocessed.
alex20_202020@reddit (OP)
But it is the engine that might change the input/story to be sent as next input. How does it explain "Most thinking models are trained to only expect the most recent think block"?
LagOps91@reddit
Yes, this is something the frontend does. Only sending the most recent think block is a common default setting. The backend typically only expects text and leaves formatting to the frontend. At best have a look at the received prompt/context, that should make it clear.