Is long re-processing of output as input a common "feature" or not?

Posted by alex20_202020@reddit | LocalLLaMA | View on Reddit | 30 comments

I now use (mostly) Gemma 4 and Qwen 3.5 models. And seems that all of them, after context grows a bit, after providing long output for me and getting a short prompt in response, are starting to process many new tokens as input and I have to wait long for new output to start.

I am using koboldcpp *, maybe on llama.cpp it works differently. I wonder, when the engine produces all this output, does it not calculate KV cache or something to use it on the next turn when output becomes part of the story / input? How does it work internally?

TIA

* with Q4-Q5 GGUFs and usually q4 for KV cache of ~130k.