Qwen3.6-35B is worse at tool use and reasoning loops than 3.5?
Posted by mr_il@reddit | LocalLLaMA | View on Reddit | 46 comments
Been running the new model entire evening in different quants and coding tasks with OpenCode. Used oMLX and LM Studio. Used recommended settings for precise tasks (temp 0.6, top-k 20, etc) and OpenCode agent. So far my findings is that the model goes into infinite reasoning loops more often than 3.5, and I sometimes see failed tool calls. The latter could be parser bugs, but the former is the model itself.
It’s ok on basic apps, but really struggles to move ahead on something more complex like a simple 3D game even when the context is nearly empty, as if it tries to be super defensive and rechecks itself continuously.
Does anyone else have similar observations?
milpster@reddit
Did you enable this?
https://www.reddit.com/r/LocalLLaMA/comments/1sne4gh/psa_qwen36_ships_with_preserve_thinking_make_sure/
SheepherderDense7888@reddit
After I added this piece of code to LM studio, I haven't seen a loop for the second day.
mr_il@reddit (OP)
Haven’t tried this yet. Although I’m not sure how that may help as the flag apparently allows thinking tokens to be preserved across turns, but the model sometimes gets stuck even with just one turn.
milpster@reddit
That is really weird, it works great for me. Do you have a recent build of llama.cpp? Do you quantize kv cache at all?
mr_il@reddit (OP)
I used the latest version available in LM Studio (2.13.0) for GGUF version and latest mlx-lm coming for oMLX. I also tried various quants between 6 bit and BF16, as well as with and without quantized KV cache. Higher precision is more reliable but still fails
brobits@reddit
asked qwen3.6-35b-a3b to generate 3 random numbers and choose 1 of those at random. it got into a reasoning loop around using a tool to generate the number or just picking a number itself without a generator. kept changing its mind
ixdx@reddit
I tested this on bartowski/Qwen3.6-35B-A3B-Q5_K_L. 10 attempts and not a single loop. The response was generated in less than 10 seconds.
--reasoning on --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking":true}'gundamcs@reddit
Same setup but it went into reasoning loop 1/15 attempts. Then I added min_p=0.05. The reasoning loop hasn't happened since.
gabbydra@reddit
Unbelievable, this has worked for me, such a small tweak. I was getting reasoning loops even with a temp of 0.8. I've now run several prompts that reliably got stuck in infinite reasoning loops with success. I am running Bartowski's Q5_K_M.gguf version (because it is smaller than Unsloth's) in LMStudio with 100k context on a 5090.
edsonmedina@reddit
In my experience tool calling has been better than 3.5 35B, but it does get stuck on infinite loops quite often
Obvious-Sea3133@reddit
I've run benchmarks on the first 100 SWE-bench Verified samples using various Unsloth quantizations.
Errors: Output does not start with 'diff --git'. The model is failing to follow the system prompt.
Incomplete: It reached the 250-pass limit
I am utilizing mini-swe-agent with a 250-pass limit and full context window.I've run benchmarks on the first 100 SWE-bench Verified samples using various Unsloth quantizations.
The benchmark for Qwen3.6-35B-A3B-UD-Q8_K_XL (Unsloth) was a disappointing surprise; it solved fewer tests and had more errors than Qwen3.5.
Has anyone else seen similar results?
I will try with others quantizations.
mr_il@reddit (OP)
Amazing, thank you so much. This kind of confirms my gut feel, although surprising to see 3.6 so low down with such a high error rate. Overall it feels that 3.6 is a bit overcooked. Like a race track car falling apart on the first pot hole on a real road.
I’d love to run your setup on some bigger hardware to see how 27B Q8 and BF16 stack up. Let’s collaborate?
Obvious-Sea3133@reddit
Yes shure. I test Qwen3.6-35B-A3B-Q5_K_M by AesSedai, and show similar performance. Tell me what do you need?
| Model | tests | resolved | unresolved | error | incomplete |
| -------- | -------- | -------- | -------- | -------- | -------- |
| Qwen3.6-35B-A3B-UD-Q8_K_XL (Unsloth) | 100 | 53 | 26 | 18 | 3 |
| Qwen3.6-35B-A3B-Q5_K_M (AesSedai) | 100 | 51 | 29 | 18 | 2 |
mr_il@reddit (OP)
Cool! I'd love to see Qwen3.5-27B:Q8_K_XL vs Qwen3.5-27B:BF16 (Unsloth is fine), and it would be great to actually see how TurboQuant KV 3.5/4-bit cache is actually affects performance. Then also comparing Gemma4-26B MoE and Gemma4-31B dense in same quants to equivalent Qwens (35B and 27B).
That aside, it'll be interesting to test low-bit quants of MiniMax-M2.5/M2.7, ones that can be realistically run locally: Q3_K_XL for example.
Finally, trying to run everything in 5 passes as opposed to 1 pass to see if there's a variability in outcomes.
I know it's a lot to ask, but these are the questions I have :D
Obvious-Sea3133@reddit
The MiniMax-M2.5-UD-Q3_K_XL model achieved a score of 71/100 with no errors. However, I am experiencing memory leaks with this model, and token generation speed degrades more rapidly as context length increases compared to the Qwen3.5-122B-A10B-UD-Q5_K_XL. Same problems with MiniMax-M2.7.
So far, I have been choosing the Qwen3.5-122B-A10B.
If you have the hardware resources, I can help you run the benchmarks.
So
mr_il@reddit (OP)
Thanks, I'll DM you
AnickYT@reddit
Is this with the updates and bugfixes?
Obvious-Sea3133@reddit
Yes, it's related to the latest versions of Qwen3.5. Qwen3.6 has not yet received any updates.
Interesting_Key3421@reddit
Yes, tested q4
jingtianli@reddit
where is this correct chat template?
milpster@reddit
i think they mean this (for llama.cpp):
--chat-template-kwargs '{"preserve_thinking":true}'Fin5ki@reddit
THIS! After adding this, when using llama.cpp, qwen3.6 has stopped looping for me. Working really well, actually. I'd say it feels even a little better than 3.5 27b at agentic stuff.
willraceforbeer@reddit
Yeah, I've seen this multiple times. It's pretty fantastic when it's not getting stuck in reasoning loops though.
I ended up going back to qwen3-coder-next for my primary coding model. It's a bit slower but easily solved challenges that were stumping 3.6.
RateRoutine2268@reddit
yup facing same issue, tried with different params , it goes into extended thinking loops for complex tasks
GregoryfromtheHood@reddit
Yeah it's been worse for me too, using Q5 and Q8 quants from Unsloth. It regularly bombs out in agentic loops or returns just a tool call inside a thinking block. Have tried combinations of the preserved thinking, no reasoning and stuff but it's super unreliable from what I've tried.
blaknite12@reddit
The tool call inside thinking blocks has been a persistent problem for me with many qwen3.5 variants. So much so I got it to write a proxy that fishes them out of the response while it’s streaming. Tool call accuracy is pretty much 100% now
RS_n@reddit
Can you share proxy code please?
blaknite12@reddit
Here you go https://github.com/blaknite/opencode-plugins
There’s a few different opencode plugins in there to improve tool usage. If you’re not using opencode the server in ./lib can be started on its own.
It’s designed for the specific limitations of running qwen3.5 in LM Studio.
It does two things: - if reasoning_content ends with a tool call it pulls it out and creates a correct tool call response - if reasoning_content is populated and content is not it automatically re-prompts the model to continue and transparently streams the response as part of the original response
I’ve found this eliminates the issue entirely.
RS_n@reddit
So..you are the bot, ok.
blaknite12@reddit
huh?
anzzax@reddit
It works very well for me, vllm and FP8. I'm replacing 120b with 3.6 35b for my non-coding agentic tasks. Here is my vllm recipe, I use recommended sampling parameters and 'preserve_thinking':
https://gist.github.com/anzax/b1c56a459ce5e6557fbb8b5de396342b
robertpro01@reddit
Not for me, it has been great for my tests.
Euphoric_Emotion5397@reddit
ya.. like gemma 4 .. i think we will see patches done to fix this and that.
lqvz@reddit
Compared to Qwen3.5 27b, it’s been more loopy.
Compared to Qwen3.5 35b A3b, its completing way more tool calls successfully.
RottenPingu1@reddit
How is 3.6 for inference?
Cosmicdev_058@reddit
It's common for model behavior to shift between versions, especially in RAG setups. I'd double-check your chat template and vLLM config for 3.6, as well as your prompt engineering. If it's still struggling, sometimes routing to a different model via an AI router like ORQ AI or even just trying a different provider can help, along with systematic evaluation to confirm the changes.
mr_Owner@reddit
Did you try other harnesses like cline etc?
And post your llama cpp flags please?
Wise-Hunt7815@reddit
great, thanks bro!
ilintar@reddit
No, zero problems in OpenCode up to 140k context. In fact, it's positively surprised me at how good it is, I'd say it's really close to being a local Gemini Flash 3 equivalent.
H_DANILO@reddit
not for me, for me, this has been the most consistent model ever.
Give it context, the only problems happens when they extrapolate context mid thinking chain
CreamPitiful4295@reddit
You have to find out why the tools failed. They fail a lot and it’s not the model’s fault. Unless you think they should just stop what they’re doing and fix it. Wouldn’t that be neat.
Sticking_to_Decaf@reddit
No problems in Hermes Agent so far
somerussianbear@reddit
Had the same issues here, not sure how things come out with so many issues. Last one was from Google for crying out loud!
vevi33@reddit
Odd. I always has a reasoning loop problem with long context with Gemma 26B4E and sometimes with 3.5 35B but not with the 3.6 version. I am very surprised how good it is. Way above everything what I've tried especially with this speed...
woolcoxm@reddit
i cant get it to stop getting stuck in reasoning loops, if you figure it out let me know. atm it will go for a prompt or 2 then get stuck in a loop.
benevbright@reddit
The same. too often invalid input happens on tool calling. Reasoning flow is good. Too bad. Not usable.