Qwen3.5 27B refuses to stop thinking

Posted by liftheavyscheisse@reddit | LocalLLaMA | View on Reddit | 30 comments

I've tried --chat-template-kwargs '{"enable_thinking": false}' and its successor --reasoning off in llama-server, and although it works for other models (I've tried successfully on several Qwen and Nemotron models), it doesn't work for the Qwen3.5 27B model.

It just thinks anyway (without inserting a tag, but it finishes its thinking with ).

Anybody else have this problem / know how to solve it?

llama.cpp b8295

[-]

Ok_Procedure_5414@reddit

System prompt. I’ve had pretty great success not messing with the templates or budgets but rather, give it the Gemini Pro system prompt- it actually works pretty great in terms of thinking depth but actually breaking out of its thinking state and getting on with replying to you

[-]

chansumpoh@reddit

this sounds fascinating - any chance you could point me to where to find this? Thanks in advance :)

[-]

lundrog@reddit

Post your config?

[-]

liftheavyscheisse@reddit (OP)

I'm running llama-server on my mac trying to run Qwen3.5 27B (Q8, unsloth dynamic Q4, and also Qwopus Q4; heck even the 40B frankenstein monster built out of two 27B models that's floating around huggingface) and they all have this issue despite other Qwen3.5 model sizes (2B, 9B, 35B-A3B, 122B-A10B) not. Which aspects of my config do you need more information about?

Command line flags are like ./llama-b8295/build/bin/llama-server --model ./models/Qwen3.5-27B.Q8_0.gguf --port 8000 --threads 8 --seed 1337 --cache-reuse 256 --reasoning off --temp 1 --top-p 0.95 --min-p 0.01 --top-k 40 --ctx-size 250000 --no-context-shift --batch-size 2048 --ubatch-size 2048 --jinja --presence-penalty 1.5 --repeat-penalty 1

[-]

smartsometimes@reddit

Would you be able to link that 40B frankenstein model? 😃

[-]

liftheavyscheisse@reddit (OP)

https://huggingface.co/DavidAU/Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking looks like he might've updated the jinja template maybe?

[-]

lundrog@reddit

In llama.cpp i use something like --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.0 --presence-penalty 2.0 --repeat-penalty 1.0 --jinja --no-context-shift --chat-template-kwargs \"{\\"enable_thinking\\": false}\"";

[-]

Time-Dot-1808@reddit

The dangling tag when thinking is disabled is a known quirk with Qwen3.5. The model generates the closing tag because the template always expects one, but the content between tags is empty.

For the chat template approach, you don't need to convert anything. llama.cpp lets you override just the Jinja template without modifying the model weights:

Extract the chat template: llama-run --dump-jinja /path/to/model.gguf > qwen35_template.jinja
Edit the template to remove or skip the thinking block when enable_thinking is false
Point llama-server at it: --chat-template-file qwen35_template.jinja

Or if you just want to strip it in post-processing, the easiest fix is filtering responses that match ^s* before displaying them.

[-]

liftheavyscheisse@reddit (OP)

I did --reasoning-budget 1, which gives it just enough thinking budget to insert a tag which immediately gets closed out by

[-]

fallingdowndizzyvr@reddit

Add "--reasoning-budget 0" to the command line. No more thinking.

[-]

liftheavyscheisse@reddit (OP)

Heyy, this worked! No more thinking. It still leaves a hanging tag, which is a little weird. Know how to get rid of that?

[-]

fallingdowndizzyvr@reddit

It still leaves a hanging tag, which is a little weird. Know how to get rid of that?

Hm... I never noticed so no. But I know they are working on reasoning budget still so hopefully it gets resolved.

[-]

liftheavyscheisse@reddit (OP)

I decided to set --reasoning-budget 1. This gives it budget to insert a tag, which immediately gets closed out by

[-]

fallingdowndizzyvr@reddit

Brilliant!

[-]

ieatdownvotes4food@reddit

using vllm there were two variables I had to set to get thinking working correctly.

--reasoning-parser qwen3 --tool-call-parser qwen3_coder

without em you would only get the trailing /think which sucked

[-]

david-deeeds@reddit

With regex, maybe? Replace all instances of with nothing.

[-]

silenceimpaired@reddit

Ban the “Wait” token. ;)

“Send comment. Wait. If the user bans the wait token another token with similar meaning may be used.”

[-]

Ok_Diver9921@reddit

The core fix (--reasoning-budget 0) is right, but worth understanding why --reasoning off doesn't work the way you'd expect. The chat template has a conditional block that checks whether thinking is enabled, but the model's weights have been trained with thinking tokens as part of the generation flow. Setting it "off" in the template removes the tag but doesn't actually suppress the model's tendency to reason before answering - it just loses the delimiter, so you get thinking content mixed into the response without any tags.

Practical tip from running these models in production: keep thinking ON for anything involving multi-step reasoning, code generation, or math. Turn it off (budget 0) for classification, extraction, and simple Q&A where the overhead isn't worth the latency. The quality difference is dramatic on reasoning tasks - I saw a 40% drop in accuracy on multi-step code edits when thinking was suppressed, but zero difference on straightforward translation and formatting tasks.

[-]

Guardian-Spirit@reddit

According to your post history, you have a personal experience, pain or a "practical tip" on each and every post you make. And a perfectly structured text.

How is this possible for a human?

[-]

StuartGray@reddit

You’re most probably use an old, outdated GGUF conversion with a faulty built in template.

Update your model to more recently released quant.

This is important because, depending on who made your quant, there are likely other template issues that will do things like break tool calling.

Also, hate to say it, but even when you turn thinking off, some prompts will generate reams of thinking-like outputs outside of thinking tags.

All the Qwen 3.5 models are seriously overtrained on thinking, and anyone claiming otherwise isn’t applying them to anything other than very easy prompts that don’t need the claimed power of these models.

It’s very easy to reproduce the thinking bleed through problem with thinking turned off.

[-]

egomarker@reddit

Copy its chat template to a separate file and swap values in the "if" block at the end of it. Use built in chat template if you want it to think and your custom chat template if you don't.

[-]

liftheavyscheisse@reddit (OP)

I would need to download the .safetensors and convert to .gguf myself then? Never done that before; got any tips?

[-]

uber-linny@reddit

No , it's the chat_template.jinja

[-]

liftheavyscheisse@reddit (OP)

All the models I've used so far have come pre-quantized as .gguf and I don't see any chat_template.jinja file anywhere. How do I make use of my new custom chat_template.jinja?

[-]

liftheavyscheisse@reddit (OP)

Ah, I see. --chat-template cli argument.

[-]