I tracked a major cache reuse issue down to Qwen 3.5’s chat template

Posted by onil_gova@reddit | LocalLLaMA | View on Reddit | 23 comments

Over the last week, I’ve been investigating cache misses while optimizing local agent workflows on my M5 Max.

My setup used oMLX.ai as a backend with agents like OpenCode.ai and Pi.dev, but I reproduced the same behavior with other backends like llama.cpp too. At first, I assumed this was an inference engine issue or a cache implementation bug.

What I kept seeing was frustrating:

In practice, a follow-up turn after a tool-heavy interaction could end up redoing tens of thousands of tokens for no good reason.

I first found a separate issue related to multimodal / first-image transitions, and I already have an oMLX PR for that.

But the bigger text-only issue turned out to be the Qwen3.5 chat template.

After tracing prompt fingerprints and comparing rendered prompts across requests, I found that the template was emitting empty historical `<think>...</think>` blocks for prior assistant turns even when there was no reasoning content. That caused equivalent conversation history to serialize differently across requests, especially after tool use.

The template itself was introducing unnecessary prompt drift.

That matters because prompt drift hurts prefix-cache reuse, which means extra token processing, more latency, and wasted compute.

The fix is really simple one-line change in the template:

from:

{%- if loop.index0 > ns.last_query_index %}

to:

{%- if loop.index0 > ns.last_query_index and reasoning_content %}

If you’re serving Qwen3.5 locally and relying on prefix caching, this may be quietly costing you performance. If you’ve noticed long follow-up turns getting unexpectedly reprocessed after tool use, this may be the reason.

I reproduced this across different agents and backends. The common factor was the shipped template.

If you’re debugging cache misses on Qwen3.5, check the chat template before adding more cache-layer workarounds.

I’ve opened PRs on the official Qwen3.5 model repos. For example:

https://huggingface.co/Qwen/Qwen3.5-122B-A10B/discussions/22

If you’ve seen similar behavior, help spread the word so this gets patched upstream.

TL;DR:I traced a major cache reuse problem in Qwen 3.5 back to the shipped chat template, not the inference engine. The template emits empty historical `...` blocks even when there is no reasoning content, which creates prompt drift, hurts prefix-cache reuse, and causes unnecessary reprocessing of large contexts after tool use. The fix is a one-line template change, and I’ve opened PRs on the official Qwen 3.5 model repos.