Qwen 3.6: worse adherence?

Posted by tkon3@reddit | LocalLLaMA | View on Reddit | 42 comments

Just swapped Qwen 3.5 for the 3.6 variant (FP8, RTX 6000 Pro) using the same recommended generation settings. My stack is vLLM (v0.19.0) + Open WebUI (v0.8.12) in a RAG setup where the model has access to several document retrieval tools.

After some initial testing (single-turn, didnt try to disable interleaved reasoning yet), I’ve noticed some significant shifts:

- 3.6 is far more "talkative" with tools. Reasoning tokens have jumped from a few dozen to several hundred (a 2x–3x increase).

- It struggles to follow specific instructions compared to 3.5.

- It seems to ignore or weight the system prompt much less.

- Despite being prompted for exhaustive answers, the final responses are significantly shorter.

I suspect a potential issue with the chat template or how vLLM handles the new weights, even though the architecture is the same. Anyone else seeing similar problems?

[-]

finevelyn@reddit

3.6 is far more "talkative" with tools. Reasoning tokens have jumped from a few dozen to several hundred (a 2x-3x increase).

This is interesting if true. With Qwen 3.5 people said giving the model tools fixes the overthinking issue, but to me it seemed like a bug more than anything, because it doesn't make sense that the model would need to think less with tools.

[-]

Glad-Mode9459@reddit

What i found says Qwen 3.6 Plus which was in free preview uses by default temp 0.2 and 0.9 top_p

[-]

exact_constraint@reddit

Tried the model out this AM on a project I’ve been building w/ 3.5 27B. Server via llama.cpp. 3.6 enjoys ignoring the read only limitation while in Plan mode - Started writing files like it was Build mode.

Seems like a capable model, but ignoring system prompts makes it a non-starter.

[-]

exact_constraint@reddit

Update 2:

Okay, ran it for a while. On a bug fix, went into a doom loop of overthinking, then started to try running write commands in Plan mode again, while spitting out half sentences and code fragments. Think I’m heading back to Qwen3.5 27b for a bit here lol.

[-]

Ariquitaun@reddit

I've been testing it with a simple prompt, "Summarise the entire history of the Destiny videogame universe in 500 words", three times. All three it went into a loop that took ages to get out of about the word count. It would be short, then add to many, then back to the original version, and so on and so forth. The quality of the final response was impressive though, considering it used its own internal knowledge.

[-]

exact_constraint@reddit

I’m waiting for Qwen3.5 to finish knocking out some bugs, then I’m going to try 3.6 again with this runtime flag. It does seem like a decent improvement over 3.5. And holy hell is it fast relative to a 27B dense model. I really wanna love it lol.

https://www.reddit.com/r/LocalLLaMA/s/nsqpI6fSPS

[-]

cinnapear@reddit

Yeah, the speed up is no joke.

[-]

IrisColt@reddit

heh

[-]

exact_constraint@reddit

Update:

Sometime in the last 3ish hours the Unsloth page updated to include this text:

“NEW! Developer Role Support for Codex, OpenCode and more: Our uploads now support the developer role for agentic coding tools.”

Redownloaded, verified the files were different via SHA-256. Seems to have fixed the issue - Can’t get the thing to violate its plan mode prompt and write a file now. Testing more.

[-]

lolwutdo@reddit

smfh, how typical of unsloth to release as fast as possible just to reupload again

[-]

Kodix@reddit

They are consistently the best under most metrics and the fastest to upload. The ungratefulness and entitlement in this community is something to behold.

[-]

yoracale@reddit

The recent GGUF updates are not our fault. in fact most GGUF updates are unfortunately out of our hands.

We re-uploaded Gemma4 4 times - 3 times were due to 20 llama.cpp bug fixes, which we helped solve some as well. The 4th is an official Gemma chat template improvement from Google themselves, so these are out of our hands. All providers had to re-fix their uploads, so not just us.

For MiniMax 2.7 - there were NaNs, but it wasn't just ours - all quant providers had it - we identified 38% of bartowski's had NaNs. Ours was 22%. We identified a fix, and have already fixed ours see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax.... Bartowski has not, but is working on it. We share our investigations always.

For Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were not optimal, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.

For Qwen3.6 is was an optional chat template bonus we decided to put into the GGUF so it could work properly in Codex etc. the original Qwen3.6 doesn't work for them

[-]

gingerbeer987654321@reddit

thanks for the openness and pragmatic approach.

[-]

computehungry@reddit

idk what they're doing. i didn't try 3.6 yet, but for all the previous buggy releases just downloading the safetensors and converting them myself worked without all these weird bugs.

[-]

Relevant-Magic-Card@reddit

Sounds like a problem with the harness. Should be able to read only in plan mode. Or execute tools, not write

[-]

exact_constraint@reddit

I tracked it down to a single line in the prompt.ts file - While an agent is forbidden from using write tools in plan mode, it’s a soft (prompt based) limit, and there’s a single narrow exception spelled out in prompt.ts where the agent is allowed to edit files in ~/.OpenCode/plans/ for recording its own instructions.

Some models won’t do it, even if asked, because the other read-only prompts are worded too strongly. Some (including 3.6, I guess) can contort themselves into figuring that it’s okay to do whatever in that directory, cause it’s the single exception.

I think I can override that behavior by editing my OpenCode.json file to specifically disallow it - That should take priority over the prompt.ts file. Haven’t tried it yet, Qwen3.5 is still busy knocking out bugs, but maybe this post will help someone who runs into the same problem.

[-]

Relevant-Magic-Card@reddit

Glad u figured it out!

[-]

IrisColt@reddit

Is it just me, or has the model's grasp of non-English languages slipped since Qwen 3.5? It feels like a step backward, but I'm not sure if I'm doing something wrong.

[-]

Cosmicdev_058@reddit

It's common for model behavior to shift between versions, especially in RAG setups. I'd double-check your chat template and vLLM config for 3.6, as well as your prompt engineering. If it's still struggling, sometimes routing to a different model via an AI router like orq.ai or even just trying a different provider can help, along with systematic evaluation to confirm the changes.

[-]

kidflashonnikes@reddit

its going to take some time (4 weeks) to get the configs sorted and the bugs for inference engines. Dont expect zero day patches. Just be patient

[-]

mlhher@reddit

From my very initial testing it works great. I just ran it and it implemented two features on two separate projects flawlessly from the first prompt (literally just two sentences) to the end without any guidance.

Like literally just a slightly better 3.5-35B. Maybe the quant (I use Q4_K_XL) or the harness (I use Late) you use are the issue. Another important thing to note is to add --chat-template-kwargs '{"preserve_thinking": true}' not sure how well it does without this I haven't tried it yet and frankly I won't as using false (the default) will increase prompt-processing time at certain points.

If you use any obscure language/framework etc. I would suggest to plug in context7/to give examples.

[-]

Exciting_Variation56@reddit

I see this is your harness. I will be building it and seeing if it can operate like my pi coding agent setup. I appreciate cutting the bloat context.

[-]

onil_gova@reddit

PSA: Qwen3.6 ships with preserve_thinking. Make sure you have it on. Details here.

[-]

Sticking_to_Decaf@reddit

Interesting. I swapped from 3.5-27B to 3.6-35B and found the tool calling in Hermes Agent much better with 3.6. It’s verbose in the reasoning but so much faster and the tool calls are still clean.

[-]

National_Meeting_749@reddit

How are you dealing with Hermes? It's self improvement cycle has never allowed me to have a chron job run how I want it, the model always "improves it"

[-]

Sticking_to_Decaf@reddit

I do find it likes to do its own thing. It isn’t always good at staying within task parameters. But usually that isn’t a problem. I have just gotten used to repeatedly telling it “do only X” or “do Y exactly like this…”. I also have it set to its most verbose mode so that I see all the reasoning and tool calls. That helps me correct it. But for highly deterministic workflows that I want exactly so, I use either n8n or Robomotion and skip Herms. I use Hermes when I have a one of task I just want done. Like take this list of YouTube URLs and download the videos to the desktop in 720p mp4 format. Or look at these two versions of this GitHub repo and summarize all the changes.

[-]

National_Meeting_749@reddit

Rip lol. I wanted an agent to set up those workflows and just do them. It's great at one off "hey do this".

[-]

-Ellary-@reddit

Qwen 3.5 done by original Qwen team.
Qwen 3.6 done by new team in short period of time.
Don't believe the benches.

[-]

unjustifiably_angry@reddit

Wasn't it like 3 people that left?

[-]

Big_Mix_4044@reddit

I used the provided chat template with llama.cpp and can confirm. It's smart but it comes with a price.

[-]

Long_comment_san@reddit

If I actually had balls, I would have worked to create a dataset that would make models really good to talk to instead of becoming developers slave

[-]

ambient_temp_xeno@reddit

"I'm getting the word..."

"benchmaxxed"

[-]

NandaVegg@reddit

Unfortunately the whole 3.6 kind of seem to be that, but for OpenClaw rather than benchmarks themselves. 3.6 Plus is a huge downgrade on general chat purpose that does not involve agentic loop (it has very weird formatting and tendency to insert eos too early).

[-]

-Ellary-@reddit

"we fired whole team and trained 35b a3b for 2 weeks to outperform gemma 4 31b dense."

[-]

some_user_2021@reddit

Nice boobs

[-]

noctrex@reddit

Try it with the default chat template, or the template from unsloth.

[-]

leonbollerup@reddit

will properly take a bit of time before it gets optimized, fixed and tweaked.. how is it without reasoning/thinking (i just disable that)

[-]

Substantial_Swan_144@reddit

Gemma 4 has a bug that makes the model go into a degenerate loop if both structured output and thinking are allowed, and it behaves similarly for tools. I worked around that allowing the model to think freely and then constraining the second reply with JSON.

But guess what: Qwen 3.6 doesn't like that at all.

I'm being forced to write a second implementation for Qwen 3.6 just because of that, but I don't even know if it's a bug due to the model being too new or if it's due to how Qwen was trained.

[-]