Qwen3.5 27B refuses to stop thinking
Posted by liftheavyscheisse@reddit | LocalLLaMA | View on Reddit | 30 comments
I've tried --chat-template-kwargs '{"enable_thinking": false}' and its successor --reasoning off in llama-server, and although it works for other models (I've tried successfully on several Qwen and Nemotron models), it doesn't work for the Qwen3.5 27B model.
It just thinks anyway (without inserting a
Anybody else have this problem / know how to solve it?
llama.cpp b8295
Ok_Procedure_5414@reddit
System prompt. I’ve had pretty great success not messing with the templates or budgets but rather, give it the Gemini Pro system prompt- it actually works pretty great in terms of thinking depth but actually breaking out of its thinking state and getting on with replying to you
chansumpoh@reddit
this sounds fascinating - any chance you could point me to where to find this? Thanks in advance :)
lundrog@reddit
Post your config?
liftheavyscheisse@reddit (OP)
I'm running llama-server on my mac trying to run Qwen3.5 27B (Q8, unsloth dynamic Q4, and also Qwopus Q4; heck even the 40B frankenstein monster built out of two 27B models that's floating around huggingface) and they all have this issue despite other Qwen3.5 model sizes (2B, 9B, 35B-A3B, 122B-A10B) not. Which aspects of my config do you need more information about?
Command line flags are like ./llama-b8295/build/bin/llama-server --model ./models/Qwen3.5-27B.Q8_0.gguf --port 8000 --threads 8 --seed 1337 --cache-reuse 256 --reasoning off --temp 1 --top-p 0.95 --min-p 0.01 --top-k 40 --ctx-size 250000 --no-context-shift --batch-size 2048 --ubatch-size 2048 --jinja --presence-penalty 1.5 --repeat-penalty 1
smartsometimes@reddit
Would you be able to link that 40B frankenstein model? 😃
liftheavyscheisse@reddit (OP)
https://huggingface.co/DavidAU/Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking looks like he might've updated the jinja template maybe?
lundrog@reddit
In llama.cpp i use something like --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.0 --presence-penalty 2.0 --repeat-penalty 1.0 --jinja --no-context-shift --chat-template-kwargs \"{\\"enable_thinking\\": false}\"";
Time-Dot-1808@reddit
The dangling tag when thinking is disabled is a known quirk with Qwen3.5. The model generates the closing tag because the template always expects one, but the content between tags is empty.
For the chat template approach, you don't need to convert anything. llama.cpp lets you override just the Jinja template without modifying the model weights:
Or if you just want to strip it in post-processing, the easiest fix is filtering responses that match ^s* before displaying them.
liftheavyscheisse@reddit (OP)
I did --reasoning-budget 1, which gives it just enough thinking budget to insert a tag which immediately gets closed out by
fallingdowndizzyvr@reddit
Add "--reasoning-budget 0" to the command line. No more thinking.
liftheavyscheisse@reddit (OP)
Heyy, this worked! No more thinking. It still leaves a hanging tag, which is a little weird. Know how to get rid of that?
fallingdowndizzyvr@reddit
Hm... I never noticed so no. But I know they are working on reasoning budget still so hopefully it gets resolved.
liftheavyscheisse@reddit (OP)
I decided to set --reasoning-budget 1. This gives it budget to insert a tag, which immediately gets closed out by
fallingdowndizzyvr@reddit
Brilliant!
ieatdownvotes4food@reddit
using vllm there were two variables I had to set to get thinking working correctly.
--reasoning-parser qwen3 --tool-call-parser qwen3_coder
without em you would only get the trailing /think which sucked
david-deeeds@reddit
With regex, maybe? Replace all instances of with nothing.
silenceimpaired@reddit
Ban the “Wait” token. ;)
“Send comment. Wait. If the user bans the wait token another token with similar meaning may be used.”
Ok_Diver9921@reddit
The core fix (--reasoning-budget 0) is right, but worth understanding why --reasoning off doesn't work the way you'd expect. The chat template has a conditional block that checks whether thinking is enabled, but the model's weights have been trained with thinking tokens as part of the generation flow. Setting it "off" in the template removes the tag but doesn't actually suppress the model's tendency to reason before answering - it just loses the delimiter, so you get thinking content mixed into the response without any tags.
Practical tip from running these models in production: keep thinking ON for anything involving multi-step reasoning, code generation, or math. Turn it off (budget 0) for classification, extraction, and simple Q&A where the overhead isn't worth the latency. The quality difference is dramatic on reasoning tasks - I saw a 40% drop in accuracy on multi-step code edits when thinking was suppressed, but zero difference on straightforward translation and formatting tasks.
Guardian-Spirit@reddit
According to your post history, you have a personal experience, pain or a "practical tip" on each and every post you make. And a perfectly structured text.
How is this possible for a human?
StuartGray@reddit
You’re most probably use an old, outdated GGUF conversion with a faulty built in template.
Update your model to more recently released quant.
This is important because, depending on who made your quant, there are likely other template issues that will do things like break tool calling.
Also, hate to say it, but even when you turn thinking off, some prompts will generate reams of thinking-like outputs outside of thinking tags.
All the Qwen 3.5 models are seriously overtrained on thinking, and anyone claiming otherwise isn’t applying them to anything other than very easy prompts that don’t need the claimed power of these models.
It’s very easy to reproduce the thinking bleed through problem with thinking turned off.
egomarker@reddit
Copy its chat template to a separate file and swap values in the "if" block at the end of it. Use built in chat template if you want it to think and your custom chat template if you don't.
liftheavyscheisse@reddit (OP)
I would need to download the .safetensors and convert to .gguf myself then? Never done that before; got any tips?
uber-linny@reddit
No , it's the chat_template.jinja
liftheavyscheisse@reddit (OP)
All the models I've used so far have come pre-quantized as .gguf and I don't see any chat_template.jinja file anywhere. How do I make use of my new custom chat_template.jinja?
liftheavyscheisse@reddit (OP)
Ah, I see. --chat-template cli argument.
mp3m4k3r@reddit
Yep and the template is on the gguf if you click on it in huggingface and scroll down a bit (its the pane that shows the ggufs baked in context settings and such)
uber-linny@reddit
It will be on the originator hf page
Efficient_Ad_4162@reddit
Look at it, its got anxiety.
HealthyCommunicat@reddit
I was having trouble with this, for some reason the EOS was missing from chat template/tokenizer.
Mastertechz@reddit
I was able to fix but I designed my software around it to force the model with good prompting you can try out my software but bottom line if you can give the model a permanent prompt saying but all thoughts in think tags then it will be proper