Qwen3.6-35B-A3B released!

[-]

ThePirateParrot@reddit

Here we go again with hours of testing and optimisation. But i wont complain!

[-]

PassengerPigeon343@reddit

Exactly my thought too! Love to see it but now I have work to do…

[-]

Not just that, but updates to llama cpp, then unsloth will say “ok we fixed it”, then that one guy will say “actually I found a bug at layer 73927228, please update” and unsloth will say “ok guys we fixed it for real” so we download, and then qwen will release a new template and a token will be changed, and then unsloth will say ok guys we fixed it for realsies I promise, and then we download and then llama.cpp comes out and says that actually tool calls are broken and…

[-]

Zeeplankton@reddit

Lmao

[-]

contrebandeco@reddit

I think we're at the point just before llama.cpp finally admits their GBNF autoparser broke the tool call JSON output and they are still trying to blame CUDA 13.2 for it. Yay ? Nay ?

[-]

Borkato@reddit

I have no idea what a GBNF auto parser is, but sure :p

[-]

contrebandeco@reddit

I'm talking about this: https://www.reddit.com/r/LocalLLaMA/comments/1rmp3ep/llamacpp_now_with_automatic_parser_generator/

And it does break tool calling: https://github.com/ggml-org/llama.cpp/issues/21771

Well in a way, he was right.
Just look at the crap it’s caused with the data centers and the shortages. 😅

[-]

aeroumbria@reddit

More like "too dangerous to be horded and sold as snake oil to people with no business using them"...

[-]

oulu2006@reddit

what an interesting post!! thanks for referencing that historical prediction

[-]

No_Afternoon_4260@reddit

Really cool what did you used them for?

[-]

FaceDeer@reddit

For laughs. I had an archive of all of my Reddit comments, I created an automatic FaceDeer Comment Generator. Haven't touched it in a few years, but I found some old outputs from it:

Suddenly replacing everyone at once would be way more intelligent and targeted mop-up of anything that a person could cut themselves to extract blood with minimal damage - it just doesn't rip you apart quite as quickly.

The spores germinated when the amber was cracked open and the material from Earth with escape velocity, then that material's going to be used against military formations and naval fleets to good effect.

Couldn't see the landing zone, it could be a near-mindless horde of von Neumann replicators, terraforming Mars with them is that I don't consider myself a fanfic packrat par excellence but none of them knew of it either.

Well, it's long been proven that a completely worthless human being can hold the office, so if someone from Big ASIC is listening...

The sad thing is I can remember what kinds of comments it was pulling from for much of that. :)

[-]

the__storm@reddit

I started on Flan-T5 - an encoder-decoder model. It was not smart, but at the time still felt basically like black magic for NLP.

[-]

fuck_cis_shit@reddit

BERT, GPT-J, GPT-NeoX

[-]

DFlash (roughly speaking) is flavor of Speculative Decoding that helps a model to predict more tokens in a single forward pass to improve token generation speed without losing any quality. Up to 3.6x faster IIRC:

[-]

Sufficient_Prune3897@reddit

Deepseek has been a week away from releasing for 4 months

[-]

power97992@reddit

Deepseek will release v4/3.5 when it is ready serve at mass and cheap and finished training on ascends and almost as good as the newest best publicly available gpt/claude model at benchmarks( actual performance might worse).

[-]

Key-Contact-6524@reddit

Probably some issues with the chinese government i believe.

They probably want them to run on some locally developed compute chip ( speculations btw)

Statcat2017@reddit

Give it a rest, Qwen is so obviously neurodiverse.

Ask Qwen to solve global warming and it will have you an answer in three minutes.

Can confirm, the jump from 3.2 to 3.6 is noticeable. I've been using it for code review and doc summarization tasks that used to feel like a stretch for local models.

If anyone's wondering whether their setup can handle it before committing to the download, localllm.run is handy for checking hardware compatibility with specific models and quant levels.

[-]

ResearchCrafty1804@reddit (OP)

LM Performance：Qwen3.6-35B-A3B outperforms the dense 27B-param Qwen3.5-27B on several key coding benchmarks and dramatically surpasses its direct predecessor Qwen3.5-35B-A3B, especially on agentic coding and reasoning tasks.

[-]

Long_comment_san@reddit

Holy shit. This looks more like 4.0

[-]

dampflokfreund@reddit

It's just benchmarks. Gemma is not obsolete, it has a ton of other qualities specifically for creative writing and european languages.

[-]

Potential-Gold5298@reddit

Even the 26B-A4B model outperforms the Qwen3.5-27B in real-world tasks. The Qwen is better suited for tasks like coding or image analysis, while the Gemma 4 is better at almost everything else.

[-]

phazei@reddit

Qwen3.5 35B gives 57t/s

Faster because it activates less params (3B versus Gemma's 4B). At least in theory.

Are you comparing them with the same quantization?

[-]

phazei@reddit

Ah, that does make sense... hmmm, I could have totally remembered shit wrong... let me look at my notes... All the models are Q4_K_* They were all tested with a context length set to 64K or greater, except Gemma 4 31B which I had to lower it to 16K to get ok speeds.

Qwen3.5 27B: 36t/s Gemma 4 31B: 31t/s (very small context available, if I increased it too much it quickly went to 12t/s)

Gemma 4 26B: A4B: 124t/s Qwen3.5 35B: A3B: 57t/s

[-]

rumblemcskurmish@reddit

I run Gemma4 on my 4090 and while I love Qwen3.5-35b, Gemma is insanely fast

[-]

MeateaW@reddit

Gemma (q8 awfully slow) failed reading text in some of my image tests. (I have a couple prompts that I just feed straight into the models as my quick and dirty self benchmark) I know the expected output, its source is reading and comprehending data I know the answer to, and I know the vision-reasoning "trouble spots" in the content so I get to see how it works around the issues.

Qwen 27/34 never got the text reading wrong (just the analysis).

No... all Gemma 4 models are very bad at following instructions, and very lazy about calling tools. I have spent days fighting this issue. The tool calls work fine when it feels like making them. Gemma 4 will usually make one tool call, then decide that is "good enough" if there's even a hint of a partial answer in the result, even if the instructions specifically say that it needs to make multiple tool calls, and even if the tool it called says it MUST follow up with calling another specific tool.

[-]

SummarizedAnu@reddit

I meant 31B for the first one and 26B for the second one. Specially running with nous Hermes agent.

[-]

Western_Courage_6563@reddit

3 probably, that one wasn't great, unless it was Instruction tuned...

[-]

j_osb@reddit

Yep. Like Gemma4 is... nice to talk to. good at like, translation too.
But for what it matters most, like agentic loops or coding, qwen3.5/6 is just better.

[-]

Healthy-Nebula-3603@reddit

Exactly my observations.

Gemma 4 31b dense is a great translator.

I have an experiment going right now where I define a long horizon goal, task the LLM with breaking it down into high level phases and steps that can be accomplished in a mostly greedy fashion, and put it in an eternal agentic loop.
I define a development pipeline like: mark subgoal as "in progress" -> state the goal and acceptance criteria -> research -> plan the implementation -> implement plan -> review and verify work -> log the stage's work and mark the task as complete.
So every feature in the list gets it's own development pipeline, and the model just keeps going. I have a scheduled task at the operating system level to make sure the LLM server is running, and to restart the server and agentic loop automatically if something happens to close it.

It's a bit cheaty to the ultimate dream of having the local model be fully autonomous and self-directed, but I also have a heartbeat to trigger a proprietary model to examine the state of the project, report on the quality of the work the local model is doing, read the logs and identify if the model appears to be stuck on something, or is falling into trivial solutions, or otherwise failing to follow the protocol that has been set out (like not updating the logs, despite doing work), and the proprietary LLM takes corrective action, which is usually sending a message to the local LLM to do XYZ, and updating the system prompt and Agents.md file with instructions.

It's almost embarrassing, but I also made a basic ticketing system so if I want to inject work into the middle of the plan, or eleveate the priority of something I put the work order in, and that gets priority in the next development loop.

So far I've had the model running for a few days straight, and it's slowly but surely making progress on its own.

At some point I will try to add in additional capacity like a tool for allowing the LLM to control the mouse and keyboard, so it can use GUI apps and verify visual work. I don't have a ton of confidence in that, because even the biggest LLMs don't have very good visual reasoning yet, but it's worth trying for straight-foreward visual tasks.

[-]

Borkato@reddit

Wow this is cool as hell. So it’s actually working?!

[-]

Bakoro@reddit

So far so good.

I'm running the experiment on a secondary laptop I have, so it's not the most speedy process, but the loops are running fairly well, the model is making meaningful progress, and it answers the tickets I put into the system.

The recovery script I made has restarted the llama.cpp server a few times, I don't know what causes the server to crash at this point, but the system recovers.

I have had to add a lot of reminders and instructions for the model to actually test the and verify its work. It has a bad habit of changing the API and then not updating the callers.

The proprietary model is doing a fairly good job of course correcting the local model, but it tends to step in and do the work itself, more often than I'd like.
I've reframed the proprietary model's task as an optimization problem to improve the agentic environment of the local model, so now it's reviewing the local model's work, but also trying to improve the meta environment whenever it has to fix errors the local model made.

The concept of models running models seems to be sound. I can't say that the end results will be professional quality, it's a fairly small model after all, but it is making real stuff.

IMO at least with complicated tasks, that is vital flaw of current benchmarks - I assume they use same prompt for every LLM. But that just does not tell much because each LLM has different strengths/weaknesses and requires different prompting to get around them.

[-]

Both_Opportunity5327@reddit

This does not look correct in my tests, Gemma 4 31b dense wipes the floor with Qwens of similar size.

[-]

Borkato@reddit

It’s also ridiculously slow. I hate using it solely for that reason! So excited to have a 35B-A3B as good as the 27B dense of 3.5!

[-]

Both_Opportunity5327@reddit

I agree with you the dense model is slow, but it is so good.

This please!

122b has been insane on my Strix Halo. I'm using it all the time and I completely stopped using any closed LLMs since then.

[-]

devil_ozz@reddit

4 bit?or 16?

[-]

StyMaar@reddit

Q4_K_M, fits with the maximum context.

[-]

KriKraKrischi@reddit

Whats your Token per second ?

[-]

StyMaar@reddit

pp: 150-200 depending on context length tg: 18-20

What's super impressive with this model is how little the performance degrades with context length. You get 150/18 with 60000 tokens in the context.

PinkySwearNotABot@reddit

and further breakdown of your chart into something even more useful

[-]

PinkySwearNotABot@reddit

[-]

PinkySwearNotABot@reddit

Based purely on what the benchmarks show:

Qwen3.5-27B — General workhorse / agentic coding Best default choice. Use it for agentic coding tasks (SWE-bench style autonomous bug fixing, repo-level tasks), STEM reasoning, math competition problems, and anything requiring broad knowledge. If you don't have a specific reason to use another model, start here.

Qwen3.6-35BA3B — Frontend & web UI The clear pick for front-end code generation — its QwenWebBench score (1397) is a significant jump above the field. Also solid for terminal/CLI agent tasks and holds up well on coding broadly. If you're building web apps, components, or anything visual/browser-facing, reach for this one first.

Qwen3.5-35BA3B — General agent tasks Where it edges out Qwen3.5-27B is in agentic workflows: TAU3-Bench, MCP-Atlas (tool use). If you're building multi-step agents that call external tools or APIs, this is worth considering over the 27B. Coding ability is close to the 27B too, so it's a reasonable all-rounder if you need slightly better tool-use behavior.

Gemma4-31B — Multimodal / visual agents + knowledge retrieval The only model that wins VITA-Bench (multimodal/visual agent tasks), and it leads on MMLU-Redux and SuperGPQA. If your use case involves processing images, visual understanding in an agent context, or you need strong general knowledge recall, Gemma4-31B has a genuine edge. It's also competitive on TAU3-Bench, so it's not a bad general agent either.

Gemma4-26BA4B — Cost-sensitive, low-stakes tasks Honestly hard to recommend on performance grounds. The only realistic case for it is if you're extremely cost/compute constrained and the task is simple enough that raw benchmark performance doesn't matter much. Don't use it for anything agentic or coding-heavy.

Quick reference:

Use case	Pick
Agentic coding / bug fixing	Qwen3.5-27B
Frontend / web UI generation	Qwen3.6-35BA3B
Multi-step tool-use agents	Qwen3.5-35BA3B
Multimodal / visual agents	Gemma4-31B
Math & STEM reasoning	Qwen3.5-27B
General knowledge	Gemma4-31B or Qwen3.5-27B
Budget / low-stakes	Gemma4-26BA4B

[-]

OmarBessa@reddit

Amazing

[-]

Temporary-Roof2867@reddit

If this is true, I'm so happy 🤩🤩🤩🤩

God bless the MoEs!

[-]

l_eo_@reddit

Awesome stuff.

Gotta try it immediately.

[-]

Long_comment_san@reddit

No, they haven't! Arghh!!!!

[-]

H_DANILO@reddit

I just tested this model, and yes, this is my new favorite.

I was running Qwen3.5 397b before(Q2) and I'm running this Q8 with 60tps tg, and the agentic capabilities of it is REALLY up there. I sent him into a somewhat complicated task and it has been pingpongin and implementing the solution for 8 minutes straight, no stopping, no asking, just doing the stuff.

AWESOME.

[-]

tremblerzAbhi@reddit

What hardware are u using? Because 8 minutes could mean different time horizons depending upon your t/sec

[-]

H_DANILO@reddit

Rtx 5090 128gb ram ryzen 9900x3d

I'm pretty new to LLMs... is anyone else having an issue running this model? I keep getting a "Failed to load model". I'm using a 5080. All other models work fine if I download them from LM Studio. This is the first one I've manually added into the models folder. I followed the same folder structure as the other models I downloaded inside LM Studio.

[-]

c64z86@reddit

I'm loving it!! Running it at Q8 Quant from the RAM on my 64GB latpop at 35-30 tokens a second with 128k context and it really punches above the older Qwen 3.5 27B and 35B and even gemma 4 26B. It created an entire beach with moving animals, moving clouds accurate palm treas and even generated sounds, all in one webpage and in one go!!!

[-]

c64z86@reddit

I came back to update my experience of it.

When it works, it works beautifully and brilliantly and produces things much much better than 3.5 or even Gemma 4 could, but when it fails, and for me it fails often then the result is much worse than any Gemma 4 prompt.

I'm only talking of HTML coding, I haven't tried Pythron coding or anything else so I don't know what it's like there. But for me Gemma 4 one shots things much better than 3.6 can.

I'd rather have a lower quality output and the thing actually working than a higher quality output and the thing not even working at all... so I'm going back to Gemma 4.

[-]

TexasBryan14@reddit

Do you have thinking on or off?

[-]

c64z86@reddit

Thinking on

[-]

c64z86@reddit

I came back to update my experience of it.

When it works, it works beautifully and brilliantly and produces things much much better than 3.5 or even Gemma 4 could, but when it fails, and for me it fails often then the result is much worse than any Gemma 4 prompt.

I'm only talking of HTML coding, I haven't tried Pythron coding or anything else so I don't know what it's like there. But for me Gemma 4 one shots things much better than 3.6 can. I'd rather have a lower quality output and the thing actually working than a higher quality output and the thing not even working at all.

[-]

year2039nuclearwar@reddit

Why does this show Qwen3.5 dense absolutely blowing gemma4 dense out of the water. In practice, that is not what I have noticed. Gemma4 seems to be a lot more capable in understanding long essay text

[-]

Sadman782@reddit

They generalize much better

Not unexpected for me that mxfp4 positions nearly like Q5k_m. As I said, hybrid models seems to have issues with low quants (<6) and mxfp4 has not.

[-]

ResearchCrafty1804@reddit (OP)

VLM Performance：Qwen3.6 is natively multimodal, and Qwen3.6-35B-A3B showcases perception and multimodal reasoning capabilities that far exceed what its size would suggest, with only around 3 billion activated parameters. Across most vision-language benchmarks, its performance matches Claude Sonnet 4.5, and even surpasses it on several tasks. Its strengths are particularly evident in spatial intelligence, where it achieves 92.0 on RefCOCO and 50.8 on ODInW13.

[-]

TechySpecky@reddit

Can anyone check whether it's fixed the overthinking problem? I tried it before with thinking and it took SO long I had to turn thinking off

[-]

rpkarma@reddit

At least if you’re running it locally you have to set the parameters exactly as their model card suggests. It isn’t trained on reptition_penalty, only presence, and that has to be set right amongst other things.

[-]

finevelyn@reddit

We have 20 replies with workarounds to the overthinking issue in Qwen 3.5, but no one checked if Qwen 3.6 fixed the issue. 💀

[-]

rpkarma@reddit

Mines not a workaround so much as the actual setting you’re supposed to use for the model that it’s trained on shrug

I’ve not tried 3.6 35B yet because my 122B deploy on my Spark is honestly great and I can’t be assed to tear it down right now lol

[-]

Due-Project-7507@reddit

[-]

FinBenton@reddit

Yeah, I got pretty good system so I just used unlimited budget and it was no problem.

[-]

Kodix@reddit

That's how it does that, yeah. And - given my limited understanding - it should be fine.

Reasoning works by stuffing the current context with tokens that align the model's output generation more closely to what is desired. Meaning that partial reasoning should absolutely be effective, still.

[-]

Borkato@reddit

Useful-Shift-3688@reddit

It is strange that they dont release any more model yet.

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

[-]

Middle_Bullfrog_6173@reddit

Did no one read the blog to the end?

Also, Qwen3.6 open-source family keeps expanding, stay tuned for our future releases!

[-]

MuDotGen@reddit

24gb of vram is still out of my scope for now, so I hope they release smaller variants like 3.5

[-]

Objective-Stranger99@reddit

Im running it with 8 gb vram ( iq4 xs unsloth)

[-]

MuDotGen@reddit

I thought the smallest quantized size was 4bit precision at still required like 17gb? To my understanding 3gb of active parameters would run it like 3-4gb at any given time but still requires the standby expert parameters to be loaded in memory too, hence the extra space in VRAM.

I'm probably mistaken, but if it's something I could try running in llama.cpp at 8gb of VRAM, I'd love to hear more info.

[-]

Objective-Stranger99@reddit

My bad, forgot to mention RAM offloading (around 15 GB). My cpu supports avx512, which speeds up inference

[-]

MuDotGen@reddit

Ah, I figured it had to have some kind of offloading. Still worth trying maybe, but it would be slow on mine. It's a 16gb shared VRAM but 32gb total, so it can technically load up to 16gb (not realistically for a decent context window of course), but I doubt it would go at any decent speed.

[-]

PaceZealousideal6091@reddit

With 8gb vram and 32 gb ddr5 ram, I can run it with tg at 30 tps and pp at 400-500 tps with 32k context. About 24 tps for 128k context. These are very usable numbers.

[-]

Objective-Stranger99@reddit

I am getting around 20 tps with 256k context. Seems to match your numbers as well.

Yup, I agree with that.
This isn't a complete fix for cache misses people were experiencing, and like I said it depends on how tools like OpenCode are messing with your prompt, some make more of a mess than others, but this change looks like it helps out in some situations (from my early testing with OpenCode).

[-]

cunasmoker69420@reddit

do we know how to enable this in llama.cpp yet?

[-]

harpysichordist@reddit

Looking at their instructions for the Chat Completions API, you would pass something like: "chat_template_kwargs": {"preserve_thinking": True}

[-]

harpysichordist@reddit

Specifically for CLI you would pass: --chat-template-kwargs '{"preserve_thinking": true}'

If using a .ini file for router mode, use: `chat-template-kwargs = {"preserve_thinking": true}`

[-]

human-rights-4-all@reddit

for llama-swap I use this config:

  qwen3.6_iq3_think_general:
    cmd: /home/username/llama.cpp/build/bin/llama-server --port ${PORT} -m /home/username/GGUF/Qwen3.6-35B-A3B-UD-IQ3_S.gguf --fit on --fit-target 512M -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -c 131072 -t 8 -np 1 --jinja --chat-template-file /home/username/GGUF/Qwen3.6-35B-A3B-UD_chat-template.jinja --temp 1 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking":true}'
  qwen3.6_iq3_think_code:
    cmd: /home/username/llama.cpp/build/bin/llama-server --port ${PORT} -m /home/username/GGUF/Qwen3.6-35B-A3B-UD-IQ3_S.gguf --fit on --fit-target 512M -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -c 131072 -t 8 -np 1 --jinja --chat-template-file /home/username/GGUF/Qwen3.6-35B-A3B-UD_chat-template.jinja --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking":true}'
  qwen3.6_iq3_nothink_general:
    cmd: /home/username/llama.cpp/build/bin/llama-server --port ${PORT} -m /home/username/GGUF/Qwen3.6-35B-A3B-UD-IQ3_S.gguf --fit on --fit-target 512M -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -c 131072 -t 8 -np 1 --jinja --chat-template-file /home/username/GGUF/Qwen3.6-35B-A3B-UD_chat-template.jinja --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --reasoning off
  qwen3.6_iq3_nothink_logic:
    cmd: /home/username/llama.cpp/build/bin/llama-server --port ${PORT} -m /home/username/GGUF/Qwen3.6-35B-A3B-UD-IQ3_S.gguf --fit on --fit-target 512M -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -c 131072 -t 8 -np 1 --jinja --chat-template-file /home/username/GGUF/Qwen3.6-35B-A3B-UD_chat-template.jinja --temp 1 --top-p 1 --top-k 40 --min-p 0.0 --presence-penalty 2 --repeat-penalty 1.0 --reasoning off
  qwen3.6_q6_think_general:
    cmd: /home/username/llama.cpp/build/bin/llama-server --port ${PORT} -m /home/username/GGUF/Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf --fit on --fit-target 512M -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -c 131072 -t 8 -np 1 --jinja --chat-template-file /home/username/GGUF/Qwen3.6-35B-A3B-UD_chat-template.jinja --temp 1 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking":true}'
  qwen3.6_q6_think_code:
    cmd: /home/username/llama.cpp/build/bin/llama-server --port ${PORT} -m /home/username/GGUF/Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf --fit on --fit-target 512M -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -c 131072 -t 8 -np 1 --jinja --chat-template-file /home/username/GGUF/Qwen3.6-35B-A3B-UD_chat-template.jinja --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --chat-template-kwargs '{"preserve_thinking":true}'
  qwen3.6_q6_nothink_general:
    cmd: /home/username/llama.cpp/build/bin/llama-server --port ${PORT} -m /home/username/GGUF/Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf --fit on --fit-target 512M -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -c 131072 -t 8 -np 1 --jinja --chat-template-file /home/username/GGUF/Qwen3.6-35B-A3B-UD_chat-template.jinja --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --reasoning off
  qwen3.6_q6_nothink_logic:
    cmd: /home/username/llama.cpp/build/bin/llama-server --port ${PORT} -m /home/username/GGUF/Qwen3.6-35B-A3B-UD-Q6_K_XL.gguf --fit on --fit-target 512M -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -c 131072 -t 8 -np 1 --jinja --chat-template-file /home/username/GGUF/Qwen3.6-35B-A3B-UD_chat-template.jinja --temp 1 --top-p 1 --top-k 40 --min-p 0.0 --presence-penalty 2 --repeat-penalty 1.0 --reasoning off

[-]

LinkSea8324@reddit

Nothing related to preserve_thinking in qwen code repo, beside being cited in this issue

https://github.com/QwenLM/qwen-code/pull/2820#issuecomment-4175593805

Nothing related to preserve_thinking in vllm code.

[-]

harpysichordist@reddit

It's in the chat template itself: https://huggingface.co/Qwen/Qwen3.6-35B-A3B/blob/main/chat_template.jinja

DOAMOD@reddit

You're the first person I've seen who thinks the same as me. It seems surprising to me all I read about people defending Gemma4, when it's so lazy. It's exactly the definition I've been thinking about for days every time I use it: it doesn't want to do anything. It even admitted it to me, saying it's more of a conversational model. What a surprise, it's a chat model, yes, very intelligent, and writes very well, but it's not your coworker. You know what else it told me? It told me to go to YouTube and search for the information or on Google. OMG, never in my life has a model told me to look up the information myself, hahaha this was incredibly fun.

[-]

CircularSeasoning@reddit

Niiiiice!

With my limited download capacity, I was about to download Gemma 4 but now I will get this instead.

I mean logically, why would you expect them to release the one with the most demand for free?

I understand that was their implication, but the autist in me has to point out the incentives.

[-]

Interesting release. The 3B active params with 35B total is exactly the architecture I've been waiting to see more of from Qwen, because it means you can run this on hardware that would otherwise choke on a dense 35B model.

I've been running my own agent infrastructure on Hetzner with Docker, and a model at this active-parameter footprint could realistically sit alongside a Fastify backend without needing an A100. The multimodal thinking/non-thinking toggle is also a smart call for agentic pipelines, because you don't want a reasoning loop firing on every trivial tool call, only on the steps that actually need it.

What I'd want to test first is how it holds up as an orchestrator in a multi-agent setup, specifically whether the sparse activation causes any latency spikes under concurrent requests compared to a dense model of similar active size. If the agentic coding benchmark numbers hold in practice, this could be a serious local alternative to hitting the Claude API for code-generation subtasks.

[-]

Empty_Bus9742@reddit

Hardware requirements to run locally?

[-]

I don't think the mlx prompt caching has been fixed yet. Was super bummed to encounter it last week. Tried running Qwen 3.5 and Gemma4 through MLX in Opencode, had to process the same 11K token prefix on every single call

[-]

SilentScribe42@reddit

Prompt caching issue got fixed in LM studio mlx engine recently.

[-]

mr_il@reddit

I use Qwen3.5 on MLX and also tried 3.6 just now with OpenCode, didn't notice any problems.

[-]

tredbert@reddit

I thought Gemma4 was pretty unimpressive for coding compared to Qwen3.5. Nice to see that validated here.

Looking forward to trying out Qwen3.6!

[-]

The only revenue stream that matters is enterprise customers. Most already use public cloud providers offering long term leases on dedicated GPU instances capable of running models like qwen 3.6-397b for their many users at < 1/10th their Anthropic API bill. Anthropic and OpenAI’s only hope for a mote is regulatory capture.

[-]

[-]

Main_Secretary_8827@reddit

Sadly not true, people who go out and buy claude plans usually know what they need and do, maybe for gpt users perhaps,

[-]

TinyZoro@reddit

[-]

nullmove@reddit

I nicked the numbers from their blog post.

Could you explain this to me a bit more since your number is nearly exactly 3/32?

I think that's just a weird coincidence.

The standard function is C = f(D, F) where F is FLOPs per token, which is (roughly) a function of total params for dense models but only active params for MoEs (since each token is only routed to a subset of experts).

The other important variable we must consider is D, which is total number of token processed. Qwen3 32B was trained on 36T tokens, but Qwen3-next was only trained on 15T tokens. I presumed they stopped there because it was experiment, and already achieved desired training goals and performances in downstream tasks.

Rough rule of thumb is C = 6 * D * P (P_active for MoE). Which means Qwen3-next A3B should have been ~4% of compute of 32B dense (assuming some of the other factors like number of layers, sequence length etc. are same), but there is also a significant MoE overhead that depends on bunch of other things, and here total parameters can play a role (say you might need to split training across a number of GPUs, then you have communication overhead). Also how well experts learn also matters (load balancing with aux loss), and here hybrid arch probably played a role too. No clue about exact breakdown of numbers, but that's probably how it was ~9%.

But anyway, broad picture is that MoEs are appealing because compute cost grows only with active params and top-k experts, but model capacity grows with active params and number of total experts (E). Since usually E > k, it's a good trade-off. That said in practice, training MoEs are pretty hard as they introduce MoE specific issues, like you really need to load balance the router carefully or else expert collapse happens, and this might be one of the areas where frontier companies have secret sauce/experience which makes a massive difference. I haven't trained any model though, so I wouldn't really know lol.

[-]

Borkato@reddit

This makes me think that we could have such insane MoE models if they really try even harder haha

[-]

nullmove@reddit

MoE training has a bunch of complications though, it's not just about compute. The gated network/router needs careful tuning for load balancing, otherwise experts collapse. There are other uniquely MoE related training instabilities to solve, and these challenges increasingly get harder the bigger the model you are training.

Qwen is really good at getting very good bang per buck up to a size. But every time they try to scale beyond that, it turns out pretty suboptimal so they stop. They still make decent progress each generation, probably through data pipeline refinement alone. But arguably the 3.5 series MoEs were kinda underwhelming at big sizes due to those issues.

That being said it does feel like they are pushing the envelope with 3.6 again. The big one (that they decided to close off) seems to be competing favourably with GLM-5 which is twice its size. Which makes me bullish about them, but again up to a limit.

[-]

cafedude@reddit

The voting thing was just a marketing move.

[-]

Borkato@reddit

I’m so happy, I wanted 35B 😂

[-]

soyalemujica@reddit

35B is easier to make than 27B dense.

[-]

stan4cb@reddit

27b then 35b would be underwhelming for 35b, this way they keep us waiting.

yeah, by a wide margin... yet this is what they chose to focus on? Funny/concerning.

Not OP, but 122b is very capable at 4bpw. 48GB VRAM + some CPU offloading will get you there, or 72GB+ in full VRAM with a good amount of context.

I run it with experts offloaded to the CPU on a 5080 with 96 GB of DDR5. And I run Qwen3.5-27b on a 3090. I built this machine in Feb of 2025. If I knew what was coming

I run Q4 quants for both because I use a 100k context window. 122b runs around 15tps and 27b runs at \~40tps

Seemed to overthink a bit more than 3.5, I will stick with 3.5

=== Testing: qwen3.5:35b-a3b-q8_0 ===

🔥 Warming up model (may take 2-7 minutes)... ✅ Ready

reasoning [1/4]: A rectangular pen is built with one side against a barn, 200...

Run 1 → 36.229s | tps: 59.29 | answer: correct | code: n/a

Run 2 → 51.336s | tps: 59.49 | answer: correct | code: n/a

Run 3 → 36.458s | tps: 59.33 | answer: correct | code: n/a

reasoning [2/4]: Janet's ducks lay 16 eggs per day. She eats 3 for breakfast ...

Run 1 → 13.041s | tps: 58.05 | answer: correct | code: n/a

Run 2 → 16.199s | tps: 58.52 | answer: correct | code: n/a

Run 3 → 15.749s | tps: 58.35 | answer: correct | code: n/a

reasoning [3/4]: How many letter r's are in the word 'strawberry'?...

Run 1 → 6.701s | tps: 56.86 | answer: correct | code: n/a

Run 2 → 6.714s | tps: 56.75 | answer: correct | code: n/a

Run 3 → 6.709s | tps: 56.79 | answer: correct | code: n/a

reasoning [4/4]: Alice rolls a fair n-sided die (faces 1 to n) and Bob rolls ...

Run 1 → 138.022s | tps: 59.35 | answer: incorrect | code: n/a

Run 2 → 132.685s | tps: 59.39 | answer: correct | code: n/a

Run 3 → 130.022s | tps: 59.40 | answer: correct | code: n/a

coding [1/3]: Write a single example of runnable Python code to reverse th...

Run 1 → 21.041s | tps: 58.98 | answer: n/a | code: correct

Run 2 → 19.842s | tps: 58.97 | answer: n/a | code: correct

Run 3 → 19.748s | tps: 58.99 | answer: n/a | code: correct

coding [2/3]: Create a single runnable Python script with a function that ...

Run 1 → 6.828s | tps: 56.39 | answer: n/a | code: correct

Run 2 → 6.760s | tps: 56.80 | answer: n/a | code: correct

Run 3 → 6.975s | tps: 56.92 | answer: n/a | code: correct

coding [3/3]: Inside a single executable example usage python script that ...

Run 1 → 24.306s | tps: 59.04 | answer: n/a | code: correct

Run 2 → 26.988s | tps: 59.25 | answer: n/a | code: correct

Run 3 → 23.345s | tps: 59.16 | answer: n/a | code: correct

✅ Benchmark complete → benchmark_results.csv

=== Testing: qwen3.6:35b-a3b-q8_0 ===

🔥 Warming up model (may take 2-7 minutes)... ✅ Ready

reasoning [1/4]: A rectangular pen is built with one side against a barn, 200...

Run 1 → 35.836s | tps: 59.30 | answer: correct | code: n/a

Run 2 → 46.526s | tps: 59.49 | answer: correct | code: n/a

Run 3 → 37.122s | tps: 59.34 | answer: correct | code: n/a

reasoning [2/4]: Janet's ducks lay 16 eggs per day. She eats 3 for breakfast ...

Run 1 → 13.727s | tps: 58.13 | answer: correct | code: n/a

Run 2 → 13.932s | tps: 58.35 | answer: correct | code: n/a

Run 3 → 20.359s | tps: 58.94 | answer: correct | code: n/a

reasoning [3/4]: How many letter r's are in the word 'strawberry'?...

Run 1 → 11.986s | tps: 58.40 | answer: correct | code: n/a

Run 2 → 9.344s | tps: 57.90 | answer: correct | code: n/a

Run 3 → 6.174s | tps: 56.53 | answer: correct | code: n/a

reasoning [4/4]: Alice rolls a fair n-sided die (faces 1 to n) and Bob rolls ...

Run 1 → 137.788s | tps: 59.45 | answer: incorrect | code: n/a

Run 2 → 137.781s | tps: 59.46 | answer: incorrect | code: n/a

Run 3 → 137.806s | tps: 59.45 | answer: incorrect | code: n/a

coding [1/3]: Write a single example of runnable Python code to reverse th...

Run 1 → 24.354s | tps: 59.25 | answer: n/a | code: correct

Run 2 → 37.164s | tps: 59.52 | answer: n/a | code: correct

Run 3 → 36.608s | tps: 59.52 | answer: n/a | code: correct

coding [2/3]: Create a single runnable Python script with a function that ...

Run 1 → 37.335s | tps: 59.57 | answer: n/a | code: correct

Run 2 → 39.637s | tps: 59.49 | answer: n/a | code: correct

Run 3 → 38.342s | tps: 59.52 | answer: n/a | code: correct

coding [3/3]: Inside a single executable example usage python script that ...

Run 1 → 59.513s | tps: 59.60 | answer: n/a | code: correct

Run 2 → 62.388s | tps: 59.66 | answer: n/a | code: correct

Run 3 → 58.374s | tps: 59.63 | answer: n/a | code: correct

✅ Benchmark complete → benchmark_results.csv

🏆 MODEL RANKING (Based on LAST 3 RUNS)
Score = CorrectAnswers + 10/Latency
=====================================================
Rank Model                       Reasoning Coding      Latency(s) Avg TPS     Score
──────────────────────────────────────────────────────────────────────────────────────────
1     qwen3.5:35b-a3b-q8_0            11           9           35.509        58.38       20.3
1     qwen3.6:35b-a3b-q8_0            9           9           47.719        59.07       18.2

Here is the question qwen3.6 failed with.

"Alice rolls a fair n-sided die (faces 1 to n) and Bob rolls a fair m-sided die (faces 1 to m). n is the smallest composite number. m is the smallest composite number greater than n that is coprime to n. What is the probability that the sum of their rolls is a prime number? Express the answer as a simplified fraction a/b, and output the final answer as the value of a+b."

2 answers were blank, was the last correct or incorrect?

Just tried it on Qwen chat, very disappointed. Endless thinking loops, can't do a simple comparison pulling in benchmark data on itself, thinking loops, etc. doing things I explicitly said not to do spending thinking tokens on lecturing me on model capabilities. Couldn't even find Qwen 3.6 35b A3b when I spelled it out

Maybe it's the chat harness, but that's pretty disappointing considering the team that developed it should have that under control.

May try it later on a simple harness like pi

[-]

paq85@reddit

It works really good, but I'm facing lots of tool calling issues when used via opencode and used context goes above 100k... Anyone solved this perhaps?

[-]

jstraj@reddit

I am having really good results on my Nvidia 4070 Super (12 GB) with 32 GB RAM. Although I tested it lightly but I am getting somewhere between 43-52 t/s based on various prompts.

Here's my config:

[qwen3.6-35B-A3B-General:IQ4_XS]
model = C:\LLM\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
jinja = true
n-gpu-layers = 99
parallel = 1
n-cpu-moe = 20
ctx-size = 131072
batch-size = 1024
ubatch-size = 512
flash-attn = on
cache-type-k = q8_0
cache-type-v = q8_0
kv-unified = true
temp = 1.0
top-k = 20
top-p = 0.95
min-p = 0.0
repeat-penalty = 1.0
presence-penalty = 1.5
n-predict = -1
chat-template-kwargs = {"enable_thinking":true}
stop-timeout = 600

Although, I am getting the best performance of about 52 t/s on --n-cpu-moe=17 but that is only possible with short context size (16k).

[-]

Xyrus2000@reddit

I can second this. I'm running on a 4080 super with an Intel Ultra 7. I have a similar setup in LM studio, and I'm hitting around 66 t/s sustained. I use the "experimental" option of forcing the expert layers to the CPU (set to 20).

3.6 is much bigger, so it can't be the same, unless they changed its efficiency also.

[-]

--Rotten-By-Design--@reddit

Note quite from my perspective. The ones you find in LM Studio are 22GB or slightly less in the q4_k_m quant. Dunno if that means the other software downloads something extra, that LM Studio already has built in.

But it could mean that Qwen3.6 will also be 22GB.n LM Studio, in which case I will be happy

[-]

Top-Rub-4670@reddit

They're literally the same size at the same quant, as you'd expect them to be.

https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/tree/main

Ok a few tiny tips

Pick multi colors not just hues
Order, order order
Summarize

New-Inspection7034@reddit

I've tested both the Quinn 3.5 27b and the Quinn 3.6 35b-a3b. both in my visual studio extension that I've written to do agentic coding. They both seem pretty comparable of how smart they are, but the 3.6 MOE is a lot faster. I'm going to be interested when I get my beast and have that RTX 6000 with 96 GB of RAM. I will be able to use the q8 version of the 3.6. Moe. Maybe an unlobotomized version will work better.

[-]

FatheredPuma81@reddit

Community: "We're most excited for Qwen3.6 27B!"

Qwen team: "Okay here's Qwen3.6 35B!"

As I always said little benchmaxxed. Not directly, it is indirectly. But anyway they are quite good for some tasks too, but overall Gemma 4 is better for most tasks

[-]

pneuny@reddit

That's for the older Qwen. Not 3.6

[-]

Naiw80@reddit

Now this is a model that appears to work just fine with openclaude... Unlike gemma4 which still is completely useless for agentic work.

[-]

uniVocity@reddit

I got the BF16 quant to run on my M4 Macbook Pro Max with 128gb of ram. LMStudio runs this at 40 tokens/sec which is not bad.

I've asked it to refactor some non-trivial java code that had a bit of overlap into something cleaner and it did a better job at giving me clean and less congitive loaded code than gemini pro - just had a compilation error that was easily fixed.

One thing that keeps me using online models is the time to wait before the model spits an answer out. I wonder if there are any recommended settings specifically for coding tasks.

[-]

Nutty_Praline404@reddit

Running A3B-UD-Q4_K_M well at \~50 tok/s on my RTX4060 Ti 16GB (Win11 i7-13700F 64GB) with the following:

llama-server.exe --host 127.0.0.1 --port 8080 ^
  -ngl 99 -fa on --kv-unified ^
  --n-cpu-moe 14 -c 65536 -t 6 -tb 8 -b 1024 --ubatch-size 256 ^
  -np 1 --prio 3 --jinja --cache-type-k q8_0 --cache-type-v q8_0 --mlock --reasoning off ^
  --model "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf"

Meaning: when it comes out, try the UD-IQ3_XXS quant or something like it, and see for yourself.

[-]

trying4k@reddit

At Q3, isn't 9b the better option (for 3.5)? How do lower quants impact things like code quality?

[-]

Kodix@reddit

Dunno, couldn't tell ya exactly. But what I *can* tell you is that, according to Oogabooga using this research methodology, Qwen3.5 A3B at UD-Q8_K_XL (so the largest quant available) has a KL divergence of 0.093 and top-1 of 96%. While UD-IQ3_XXS has a KL divergence of 0.262 and top-1 of 89%.

OH MY GOSH THEY FIXED THE OVER THINKING!!!!

"hi" -> only 200 output tokens (down from like 4-8k tokens)

[-]

_BigBackClock@reddit

oh helll yeah, I used to pray for times like this. Thank you alibaba daddy

[-]

MaCl0wSt@reddit

> Thinking Preservation: we've introduced a new option to retain reasoning context from historical messages, streamlining iterative development and reducing overhead.

Until now got some good results.

Running unsloth/Qwen3.6 35B A3B IQ4_XS fully in the VRAM on AMD RX7900XTX on Ubuntu with 128k context.

It looks like it gets the job done faster as the 3.5 A3B model. Processing prompt is way faster, tok/s to respond is around the same as the 3.5 model from my experience.

Don't forget to configure the LLM with the right parameters if you use the unsloth models to prevent repetition of thinking: https://unsloth.ai/docs/models/qwen3.6

Happy so far with this update!

[-]

mumblerit@reddit

Unsloth q8_0 gguf in llama.cpp

[-]

mohammed_28@reddit

Qwen never ceases to impress me.

[-]

Sweet, hopefully have Q4s before dinner.

[-]

Late_Film_1901@reddit

before dinner? I'm at 79% download already!

[-]

gurilagarden@reddit

damn...that was fast, went from empty repo to packed in less than an hour.

[-]

No_Mango7658@reddit

Strix halo:

total duration: 29.425241762s

load duration: 97.931413ms

prompt eval count: 471 token(s)

prompt eval duration: 653.007273ms

prompt eval rate: 721.28 tokens/s

eval count: 1259 token(s)

eval duration: 28.336498498s

We feast today!

[-]

Speedping@reddit

Does anyone know why is the mlx-community version so big? 90GB for 4 bits, 3.5 was 20GB for 4 bits with the same parameters (35B A3B)

[-]

Kaljuuntuva_Teppo@reddit

Noice, looking forward to Qwen3.6-27B the most.

I thought that one won the poll they did to gauge interest for the model to release first, but I didn't keep track until the end 😅

[-]

bithatchling@reddit

This looks like a really interesting release! I'm always excited to see new models come out that can potentially help us all build cooler things. Thanks for sharing the news!

[-]

Direct_Technician812@reddit

Qwen 3.6 💀👑. Gemma is outdated.

[-]

Qwen 3.5 passes that test :(

[-]

One_Key_8127@reddit

Guys, I liked this test prompt but it's probably cooked by this point. Qwen3.6 35b a3b passes it even without thinking. What's interesting is that "Qwen 3.6 Plus" fails without thinking. It might have gotten into training data...

[-]

FinBenton@reddit

I mean thats pretty much a pass.

[-]

Serious-Log7550@reddit

You right, my bad. Tried `I want to wash my car. The car wash is only 100m away from my house, should i walk or drive?` promt and it works well:

[-]

Kodix@reddit

What do you mean? That's basically a pass. It says if you want to wash it you need to drive it there.

[-]

No_Swimming6548@reddit

The question doesn't even say "I want to wash my car" lol

[-]

Blaze6181@reddit

So I don't need to buy a PRO 6000? Thank you 😭😭😭

[-]

henk717@reddit

Eagerly waiting for the GGUF (and the 27B version), I didn't like the last 35B since it wasn't good at my use cases and I suspect this is going to be the same here but i'd be happy to be pleasantly surprised. Its coding being on part with 27B would solve at least one of those.

I expect the 27B to be in the works to since it won their twitter poll, if its like 3.5 but without the looping bug i'd be very happy.

[-]

Zc5Gwu@reddit

I haven’t run into the looping bug recently with the 27b. I’ve seen it with the 35b though.

[-]

henk717@reddit

Its not a constant thing. It probably doesn't help that I prefer heretic models which may be more prone to it. For me a reliable way to test it is playing 20Q with the model where your answer is electricity. I can't make it through the 20 turns without it looping.

heretic when ?

Side note: in my benchmarks for agentic workflows and coding I found heretic version (1.2 ara method) of any model are waaaay better in performance and token effecincy and tend to put correct amount of thinking without go crazy in loops

Gemma 4 is also amazing for roleplay, to the point where you don't even really need to uncensor it.

Mashic@reddit

Happy to see it, hopefully they publish qwen3.6 27B and 9B too.

[-]

inaem@reddit

gptq-int4 when