I tracked a major cache reuse issue down to Qwen 3.5’s chat template
Posted by onil_gova@reddit | LocalLLaMA | View on Reddit | 23 comments
Over the last week, I’ve been investigating cache misses while optimizing local agent workflows on my M5 Max.
My setup used oMLX.ai as a backend with agents like OpenCode.ai and Pi.dev, but I reproduced the same behavior with other backends like llama.cpp too. At first, I assumed this was an inference engine issue or a cache implementation bug.
What I kept seeing was frustrating:
- the model would read a large amount of context
- it would make a chain of tool or function calls
- I’d ask a simple follow-up question
- and instead of reusing the prompt prefix, a large chunk of the conversation would get reprocessed from much earlier in the history
In practice, a follow-up turn after a tool-heavy interaction could end up redoing tens of thousands of tokens for no good reason.
I first found a separate issue related to multimodal / first-image transitions, and I already have an oMLX PR for that.
But the bigger text-only issue turned out to be the Qwen3.5 chat template.
After tracing prompt fingerprints and comparing rendered prompts across requests, I found that the template was emitting empty historical `<think>...</think>` blocks for prior assistant turns even when there was no reasoning content. That caused equivalent conversation history to serialize differently across requests, especially after tool use.
The template itself was introducing unnecessary prompt drift.
That matters because prompt drift hurts prefix-cache reuse, which means extra token processing, more latency, and wasted compute.
The fix is really simple one-line change in the template:
from:
{%- if loop.index0 > ns.last_query_index %}
to:
{%- if loop.index0 > ns.last_query_index and reasoning_content %}
If you’re serving Qwen3.5 locally and relying on prefix caching, this may be quietly costing you performance. If you’ve noticed long follow-up turns getting unexpectedly reprocessed after tool use, this may be the reason.
I reproduced this across different agents and backends. The common factor was the shipped template.
If you’re debugging cache misses on Qwen3.5, check the chat template before adding more cache-layer workarounds.
I’ve opened PRs on the official Qwen3.5 model repos. For example:
https://huggingface.co/Qwen/Qwen3.5-122B-A10B/discussions/22
If you’ve seen similar behavior, help spread the word so this gets patched upstream.
TL;DR:I traced a major cache reuse problem in Qwen 3.5 back to the shipped chat template, not the inference engine. The template emits empty historical `
EffectiveCeilingFan@reddit
Pretty sure this is intended behavior. I also don't understand how this hurts cache reuse in a way that your change fixes? The reasoning the model did is thrown away from conversation history, so it must be purged from the cache either way. So I fail to see how this prevents this from happening.
onil_gova@reddit (OP)
the issue is the template was also injecting **empty historical** `... ` blocks into prior assistant turns even when there was no reasoning content.
So its not about keeping old reasoning around. It's about the same prior history getting serialized two different ways across requests:
That changes the prompt text for earlier turns, which breaks prefix-cache reuse even though the actual conversation history is semantically the same.
The fix just stops adding empty historical scaffolding when `reasoning_content` is empty.
x0wl@reddit
Yes, that's because the model is trained to expect the empty block in non-thinking mode, so you should insert the block when throwing away reasoning
onil_gova@reddit (OP)
you are mixing up the live generation prompt with historical message rendering.
The non-thinking generation block at the end of the template is unchanged, so non-thinking mode still inserts the empty ` ` block for the current turn.
the fix only changes how the historical assistant turns are serialized. If a prior assistant message has no `reasoning_content`, it shouldn’t get an empty historical ` ` wrapper added to it, because that changes prompt serialization without changing the actual conversation state.
x0wl@reddit
I mean I hope it works, but I think the model was trained with the empty blocks in the content
EffectiveCeilingFan@reddit
My thoughts exactly. I don't see any reason why Qwen3.5 would publish a different official chat template than they used for training, so the model was likely trained to expect the empty thinking blocks.
onil_gova@reddit (OP)
The point is that the model is already completing the original turn correctly without those empty historical wrappers doing any useful work.
The issue is not the live turn. the issue is adding empty blocks back into old turns in history on later requests, even when there was no reasoning content there in the first place.
So this isn’t changing what the model needed to answer. It is only avoiding extra empty markup being reintroduced later and forcing unnecessary recompute.
dreamkast06@reddit
You both are misunderstanding each other 😅
Removing them in old turns affects generation because it was trained to see them in the context. Modifying the template like you did isn't optimal for the model's performance, but it is wonderful that the model does seem to be working around it.
Most users will be willing to compromise instead of accepting the the cache invalidation.
onil_gova@reddit (OP)
that’s exactly the part i’m not convinced by. If those empty wrappers were actually required for the model to stay on track, i’d expect them to matter during the original chain of tool calls too. But the model already completes those turns correctly before the later re-render happens.
The failure only shows up on follow-up requests, when old turns get serialized differently than they were effectively used during the original interaction. that’s why this looks more like unnecessary history drift than a meaningful part of the model’s working context.
So from what i’ve seen, this doesn’t feel like a quality tradeoff so much as avoiding empty markup being reintroduced after the fact.
dreamkast06@reddit
It's like if you were to remove all of the newlines between paragraphs in english to save space. You'll probably still be able to read it, but might get confused since you were expecting it to be broken up.
The model is just smart enough to deal with it. Qwen 3 Coder wasn't very good at dealing with misformatted templates; 3.5 seems to be doing better.
If only GPT-OSS would have been this forgiving...
onil_gova@reddit (OP)
The model already completes the first pass and tool flow just fine without needing those empty historical wrappers to mean anything. The problem only shows up later, when the same old turn gets re-rendered differently on follow-up requests and forces unnecessary recompute.
so the fix is basically: if the model already handled the turn correctly and there was no actual reasoning content there, don’t add empty history markup back in afterward.
i’ve tested it quite a bit locally, and the result was just better cache reuse without breaking the actual interaction flow.
DarkEye1234@reddit
I can confirm this seems to be working much better. Will play with it for a day or two. I had an issue with opencode, which I attribute to injecting some empty messages, which were causing cache invalidation and checkpoint recreation all the time.
This seems to be working much better and flow is more like with gemma4 right now (no cache invalidations)
onil_gova@reddit (OP)
Nice, thanks for testing it 💯
That lines up with what i was seeing too. Empty history scaffolding was causing way more invalidation than it should have. Glad it’s behaving better for you now.
ganhedd0@reddit
Oh bless you, my child. I was wondering why Qwen3.5-27b kept reprocessing in LM Studio...runs like a charm now.
onil_gova@reddit (OP)
🫶
FullOf_Bad_Ideas@reddit
I don't have any issues with context reprocessing of local Qwen 3.5 397B in Roo, Opencode and CC. I use TabbyAPI with some vibe coded tool call parsing support, not sure what's happening in the templates there since I never read those code edits. Just putting it out as a datapoint.
onil_gova@reddit (OP)
my guess is TabbyAPI or the surrounding integration may already be normalizing or serializing history in a more stable way, so the template issue doesn’t show up as obviously there. That’s kind of the tricky part with this bug. Some stacks may already have a bandaid that accidentally mask it.
What tipped me off was tracing prompt fingerprints and cache reuse directly across follow-up turns. if you are not looking at the rendered prompt or cache behavior, it’s easy to miss.
b0tm0de@reddit
hey maybe this is interesting for you
https://github.com/QwenLM/Qwen3/issues/1831
https://github.com/QwenLM/Qwen3/issues/1826
onil_gova@reddit (OP)
thanks for sharing, tracking now
guiopen@reddit
I made a post a while ago talking about exactly this issue and how to fix it in the template when the reasoning is off, I think it can help you:
https://www.reddit.com/r/LocalLLaMA/s/o9GfvqDF7R
Pixer---@reddit
If this fixes the re prompt processing in opencode I love you
onil_gova@reddit (OP)
Yep, that's exactly what this should fix! Give it a go.
cviperr33@reddit
i had same issues with qwen 3.5 on llama ccp , it would prefill the entire contex on almost all message's so i switched to gemma