Qwen3.6-35B is worse at tool use and reasoning loops than 3.5?

Posted by mr_il@reddit | LocalLLaMA | View on Reddit | 46 comments

Been running the new model entire evening in different quants and coding tasks with OpenCode. Used oMLX and LM Studio. Used recommended settings for precise tasks (temp 0.6, top-k 20, etc) and OpenCode agent. So far my findings is that the model goes into infinite reasoning loops more often than 3.5, and I sometimes see failed tool calls. The latter could be parser bugs, but the former is the model itself.

It’s ok on basic apps, but really struggles to move ahead on something more complex like a simple 3D game even when the context is nearly empty, as if it tries to be super defensive and rechecks itself continuously.

Does anyone else have similar observations?

[-]

milpster@reddit

Did you enable this?

https://www.reddit.com/r/LocalLLaMA/comments/1sne4gh/psa_qwen36_ships_with_preserve_thinking_make_sure/

[-]

SheepherderDense7888@reddit

After I added this piece of code to LM studio, I haven't seen a loop for the second day.

[-]

mr_il@reddit (OP)

Haven’t tried this yet. Although I’m not sure how that may help as the flag apparently allows thinking tokens to be preserved across turns, but the model sometimes gets stuck even with just one turn.

[-]

milpster@reddit

That is really weird, it works great for me. Do you have a recent build of llama.cpp? Do you quantize kv cache at all?

[-]

mr_il@reddit (OP)

I used the latest version available in LM Studio (2.13.0) for GGUF version and latest mlx-lm coming for oMLX. I also tried various quants between 6 bit and BF16, as well as with and without quantized KV cache. Higher precision is more reliable but still fails

[-]

brobits@reddit

asked qwen3.6-35b-a3b to generate 3 random numbers and choose 1 of those at random. it got into a reasoning loop around using a tool to generate the number or just picking a number itself without a generator. kept changing its mind

[-]

ixdx@reddit

I tested this on bartowski/Qwen3.6-35B-A3B-Q5_K_L. 10 attempts and not a single loop. The response was generated in less than 10 seconds.
--reasoning on --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking":true}'

[-]

gundamcs@reddit

Same setup but it went into reasoning loop 1/15 attempts. Then I added min_p=0.05. The reasoning loop hasn't happened since.

[-]

gabbydra@reddit

Unbelievable, this has worked for me, such a small tweak. I was getting reasoning loops even with a temp of 0.8. I've now run several prompts that reliably got stuck in infinite reasoning loops with success. I am running Bartowski's Q5_K_M.gguf version (because it is smaller than Unsloth's) in LMStudio with 100k context on a 5090.

[-]

edsonmedina@reddit

In my experience tool calling has been better than 3.5 35B, but it does get stuck on infinite loops quite often

[-]

Obvious-Sea3133@reddit

I've run benchmarks on the first 100 SWE-bench Verified samples using various Unsloth quantizations.

Model	tests	resolved	unresolved	error	incomplete
Qwen3.5-35B-A3B-Q4_K_M	100	59	25	14	2
Qwen3.5-35B-A3B-UD-Q6_K_XL	100	59	29	5	5
Qwen3.5-35B-A3B-Q8_0	100	59	30	8	3
Qwen3.5-122B-A10B-UD-Q5_K_XL	100	69	28	0	3
Qwen3.5-27B-UD-Q4_K_XL	100	71	26	2	1
Qwen3.6-35B-A3B-UD-Q8_K_XL	100	53	26	18	3

Errors: Output does not start with 'diff --git'. The model is failing to follow the system prompt.
Incomplete: It reached the 250-pass limit

I am utilizing mini-swe-agent with a 250-pass limit and full context window.I've run benchmarks on the first 100 SWE-bench Verified samples using various Unsloth quantizations.

The benchmark for Qwen3.6-35B-A3B-UD-Q8_K_XL (Unsloth) was a disappointing surprise; it solved fewer tests and had more errors than Qwen3.5.

Has anyone else seen similar results?

I will try with others quantizations.

[-]

mr_il@reddit (OP)

Amazing, thank you so much. This kind of confirms my gut feel, although surprising to see 3.6 so low down with such a high error rate. Overall it feels that 3.6 is a bit overcooked. Like a race track car falling apart on the first pot hole on a real road.

I’d love to run your setup on some bigger hardware to see how 27B Q8 and BF16 stack up. Let’s collaborate?

[-]

Obvious-Sea3133@reddit

Yes shure. I test Qwen3.6-35B-A3B-Q5_K_M by AesSedai, and show similar performance. Tell me what do you need?

| -------- | -------- | -------- | -------- | -------- | -------- |

| Qwen3.6-35B-A3B-UD-Q8_K_XL (Unsloth) | 100 | 53 | 26 | 18 | 3 |

| Qwen3.6-35B-A3B-Q5_K_M (AesSedai) | 100 | 51 | 29 | 18 | 2 |

[-]

mr_il@reddit (OP)

Cool! I'd love to see Qwen3.5-27B:Q8_K_XL vs Qwen3.5-27B:BF16 (Unsloth is fine), and it would be great to actually see how TurboQuant KV 3.5/4-bit cache is actually affects performance. Then also comparing Gemma4-26B MoE and Gemma4-31B dense in same quants to equivalent Qwens (35B and 27B).

That aside, it'll be interesting to test low-bit quants of MiniMax-M2.5/M2.7, ones that can be realistically run locally: Q3_K_XL for example.

Finally, trying to run everything in 5 passes as opposed to 1 pass to see if there's a variability in outcomes.

I know it's a lot to ask, but these are the questions I have :D

[-]

Obvious-Sea3133@reddit

The MiniMax-M2.5-UD-Q3_K_XL model achieved a score of 71/100 with no errors. However, I am experiencing memory leaks with this model, and token generation speed degrades more rapidly as context length increases compared to the Qwen3.5-122B-A10B-UD-Q5_K_XL. Same problems with MiniMax-M2.7.

So far, I have been choosing the Qwen3.5-122B-A10B.

If you have the hardware resources, I can help you run the benchmarks.

[-]

mr_il@reddit (OP)

Thanks, I'll DM you

[-]

AnickYT@reddit

Is this with the updates and bugfixes?

[-]

Obvious-Sea3133@reddit

Yes, it's related to the latest versions of Qwen3.5. Qwen3.6 has not yet received any updates.

[-]

Interesting_Key3421@reddit

Yes, tested q4

[-]

jingtianli@reddit

where is this correct chat template?

[-]

milpster@reddit

i think they mean this (for llama.cpp):

--chat-template-kwargs '{"preserve_thinking":true}'

[-]

Fin5ki@reddit

THIS! After adding this, when using llama.cpp, qwen3.6 has stopped looping for me. Working really well, actually. I'd say it feels even a little better than 3.5 27b at agentic stuff.

[-]

willraceforbeer@reddit

Yeah, I've seen this multiple times. It's pretty fantastic when it's not getting stuck in reasoning loops though.

I ended up going back to qwen3-coder-next for my primary coding model. It's a bit slower but easily solved challenges that were stumping 3.6.

[-]

RateRoutine2268@reddit

yup facing same issue, tried with different params , it goes into extended thinking loops for complex tasks

[-]

GregoryfromtheHood@reddit

Yeah it's been worse for me too, using Q5 and Q8 quants from Unsloth. It regularly bombs out in agentic loops or returns just a tool call inside a thinking block. Have tried combinations of the preserved thinking, no reasoning and stuff but it's super unreliable from what I've tried.

[-]

blaknite12@reddit

The tool call inside thinking blocks has been a persistent problem for me with many qwen3.5 variants. So much so I got it to write a proxy that fishes them out of the response while it’s streaming. Tool call accuracy is pretty much 100% now

[-]

RS_n@reddit

Can you share proxy code please?

[-]

blaknite12@reddit

Here you go https://github.com/blaknite/opencode-plugins

There’s a few different opencode plugins in there to improve tool usage. If you’re not using opencode the server in ./lib can be started on its own.

It’s designed for the specific limitations of running qwen3.5 in LM Studio.

It does two things: - if reasoning_content ends with a tool call it pulls it out and creates a correct tool call response - if reasoning_content is populated and content is not it automatically re-prompts the model to continue and transparently streams the response as part of the original response

I’ve found this eliminates the issue entirely.

[-]

RS_n@reddit

So..you are the bot, ok.

[-]

blaknite12@reddit

huh?

[-]

anzzax@reddit

It works very well for me, vllm and FP8. I'm replacing 120b with 3.6 35b for my non-coding agentic tasks. Here is my vllm recipe, I use recommended sampling parameters and 'preserve_thinking':
https://gist.github.com/anzax/b1c56a459ce5e6557fbb8b5de396342b

[-]

robertpro01@reddit

Not for me, it has been great for my tests.

[-]

Euphoric_Emotion5397@reddit

ya.. like gemma 4 .. i think we will see patches done to fix this and that.

[-]

lqvz@reddit

Compared to Qwen3.5 27b, it’s been more loopy.

Compared to Qwen3.5 35b A3b, its completing way more tool calls successfully.

[-]

RottenPingu1@reddit

How is 3.6 for inference?

[-]

Cosmicdev_058@reddit

It's common for model behavior to shift between versions, especially in RAG setups. I'd double-check your chat template and vLLM config for 3.6, as well as your prompt engineering. If it's still struggling, sometimes routing to a different model via an AI router like ORQ AI or even just trying a different provider can help, along with systematic evaluation to confirm the changes.

[-]

mr_Owner@reddit

Did you try other harnesses like cline etc?

And post your llama cpp flags please?

[-]

Wise-Hunt7815@reddit

great, thanks bro!

[-]

ilintar@reddit

No, zero problems in OpenCode up to 140k context. In fact, it's positively surprised me at how good it is, I'd say it's really close to being a local Gemini Flash 3 equivalent.

[-]

H_DANILO@reddit

not for me, for me, this has been the most consistent model ever.

Give it context, the only problems happens when they extrapolate context mid thinking chain

[-]