Qwen3.6-35B-A3B solved coding problems Qwen3.5-27B couldn’t
Posted by simracerman@reddit | LocalLLaMA | View on Reddit | 114 comments
Yeah, another one of those new shiny model is better than previous SOTA, and I understand why you’d roll your eyes. I ignored Qwen3.6 for the first 24 hours thinking it’s overhyped like the last one, but eventually decided to put the doubts aside yesterday and set to try it Only against the issues Qwen3.5-27B simply couldn’t solve no matter how I tackled the issue.
Qwen3.5-27B-Q4_K_M helped me build a customized budgeting app to replace a cloud-based one I used for almost a decade. It tracks expenses, income, builds dynamic budgets, imports/exports from bank accounts, built in charts, modern interface, and a bunch more little features.
While it worked great, I just found that 27B was introducing technical debt as I kept on adding features. Once a week I’d do a few cleanups here and there, but at some point it hit a wall. I 100% thought it was Opencode limitation as 27B was eating up all the requirements that Qwen3-Next, Gemma4-31B and even Qwen3.5-122B couldn’t get.
When Qwen3.6-35B-A3B dropped, I recalled my time testing the previous Qwen3.5-35B-A3B, and that was a giant waste of time at least for my project needs. Then yesterday, I broke after all the Positive posts in this sub and wanted to dive in again.
The new 35B SLAPS! I pit it against all the failed implementations and bugs its 27B previous brother introduced, and it kept solving those either 1-shot or 2-shot at worst. Feeling motivated, I promoted it to review and tackle all code inefficiencies, and potential security risks. Asked it to use subagents to split the work and never go above the 128k context window. About 20 mins later it produced a pristine report of what to do, then flipping the agent to Build mode took it another 30 mins to address everything.
On my 5070 Ti 16GB, the Q5_K_XL is pretty good. \~320t/s processing, and 50t/s for generation it thinks too much but rarely goes into any loops. It has some wrinkled areas still like it doesn’t respect the Plan mode in Opencode and ends up writing files, but I promoted around it to avoid that for now. If you had doubts or thought this ain’t for me, just give it a shot. It won‘t be a waste of time at the least.
If the new Qwen team can improve so much upon the last 35B, how would the new 27B do?!
Clean_Initial_9618@reddit
On 16Gb vram how you using Q5_X_XL can you pls share your command
simracerman@reddit (OP)
Here:
${llamasvr} -m ${mpath}\Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --no-mmap -c 128000 -np 1 -ncmoe 22 --chat-template-kwargs "{\"preserve_thinking\":true}" --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0
It offloads to 64GB DDR5 memory.
relmny@reddit
Thanks!
I have a 4080 super 16gb VRAM and 128gb DDR4 RAM and I only get 37t/s at most.
Didn't know the 5070 and DDR5 could make such a difference...
Most-Trainer-8876@reddit
Bro I got 5070ti + 64GB DDR4 Ram and I get only 25tk/s speed, lol. I am using q6_K_M at 131K context.
HOW ARE YOU GUYS GETTING SUCH SPEEDS!?
GrungeWerX@reddit
It's probably your settings, which are?
Most-Trainer-8876@reddit
GrungeWerX@reddit
I’m on my phone, so must be brief. If you want immediate speed boost, change kv cache to Q8, which is almost zero loss in intelligence, but will free up your RAM for faster processing. If you don’t need that much context, lower it until you find the sweet spot. I’m currently at 100K. If your gpu offload is at max, then maybe dial it back to fewer layers, until you find the sweet spot between prompt processing and output.
EternalVision@reddit
Maybe use a drafting model, it can speed things up a lot (1,7x~2x if set up right) if you have a little more memory remaining. Definitely worth exploring if you haven't already.
relmny@reddit
look at the command OP in the thread shared. I could get, with prompts like "hi" (just as a test), 33t/s with Q6KL. And my GPU is now "very old"...
MutantEggroll@reddit
It's the DDR5 that makes most of the difference here, since inference speed is determined by memory bandwidth. Assuming you've got DDR4-3200 and OP has DDR5-4800, that's a 50% bandwidth increase, and 50tk/s just so happens to be almost 50% more than 37tk/s.
KURD_1_STAN@reddit
vram bandwidth is also 20% faster and that is as much if not more important.
relmny@reddit
thanks for the explanation!
simracerman@reddit (OP)
Mine is the Ti version, so you get about 20% better performance.
relmny@reddit
and is Blackwell, right? so it might have instructions that mine doesn't... mine got old very fast...
simracerman@reddit (OP)
Correct. Blackwell is optimized more for AI workloads, and Nvidia seemed to prefer building on it more than previous hardware. It's honestly one of the reasons why I didn't get a 3090 with more VRAM. Once this mess of hardware prices settles a bit, I'll pick a 5090 or hopefully the foretold 5070Ti Super with 24GB they promised before the VRAM shortage.
use_your_imagination@reddit
I think you can just use
--reasoning on/offinstead of the --chat-template-kwargs if you are using a recent llama.cpp versionsimracerman@reddit (OP)
The preserve thinking is not a reasoning on/off switch. This one tells the model to include reasoning tokens in the calculation of next token.
use_your_imagination@reddit
Oh I didn't notice the param content, I wrongly assumed you were switching the reasoning.
I learned something new !
Do you notice an improvent in quality when you include the reasoning tokens ?
simracerman@reddit (OP)
I haven’t, because I turned that on once downloaded, but others say it helps with coding and reasoning tasks. Didn’t see a speed degradation for sure.
Willing-Car-4010@reddit
With my PC (Ryzen 5 7500F / RX 9060XT 16GB / 32GB DDR5 6000MHz RAM), I'm able to achieve 25 t/s.
This is the command:
llama-server -m ~/Development/llamacppmodels/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --no-mmap -c 64000 -np 1 -ncmoe 22 --chat-template-kwargs '{"preserve_thinking":true}' --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0Still-Wafer1384@reddit
Don't use -ncmoe, -fit on works better (it's on by default btw)
simracerman@reddit (OP)
I was able to squeeze about 4-5 t/s in Tg with this. -fit is a bit conservative for my taste.
jwpbe@reddit
use -fitt to set a different remaining vram target, i experiment with 384 to 640 until something works
Still-Wafer1384@reddit
Interesting, I found the exact opposite. What about prompt processing?
simracerman@reddit (OP)
That I’ll have to check, but likely was in the same ballpark for me to keep this parameter in.
BigYoSpeck@reddit
If you notice they're only getting 50t/s generation speed. Still way faster than 27b as a dense model, but nowhere near the throughput an MOE can achieve when entirely in VRAM. They will be using -ncmoe to offload some expert layers to CPU, taking the performance hit for the weights in those layers being read from regular RAM
If you have the system RAM for it then even the big 120b+ models can do this at still usable speeds. I get 20t/s from Qwen3.5 122b
chocofoxy@reddit
bro how are you using Q5 i tried Q4 on my 5060TI 16Gb ( offloaded ) max i get 19t/s even by offloading 4 layers from the 8 to the cpu, i tried Q2 it fits and i get 80t/s but i don't trust it , how are you loading Q5 and getting 50t/S
relmny@reddit
I also would like to know, I have a 16gb VRAM + 128gb RAM and I only get less than 33t/s.
orangejake@reddit
try looking at the discussion on this post
https://www.reddit.com/r/LocalLLaMA/comments/1sor55y/rtx_5070_ti_9800x3d_running_qwen3635ba3b_at_79_ts/
it helped me, I was able to get \~55 tokens/s on a 3080.
MmmmMorphine@reddit
Seems he'd have to have a pretty high end system (mostly in regard to pcie5 and fast ddr5 ram) to do this
Though there's lots of little optimizations you can make to improve things
orangejake@reddit
I have a PC from 2021 with a 3080 and DDR4 ram, I'm getting \~55 t/s with Qwen 3.6
It's slower than what my macbook gets w/ 48GB of unified memory, but much more usable than I was expecting. Might have convinced me to put off getting a 3090.
JLeonsarmiento@reddit
Not necessarily, I have it on an 2 years old Apple mid tier and q5 of Qwen3.6 is outputting ~50 t/s too.
MmmmMorphine@reddit
Ah yeah, that is the other possibility. I assumed it isn't a unified memory system given the gpu
SmallHoggy@reddit
Have you tried ik_llama? Try offload just some experts, not the full layers.
Caffdy@reddit
is this for llama.cpp?
BigYoSpeck@reddit
Yes
Clean_Initial_9618@reddit
I have a rtx 3090 and 64gb system RAM what's the max model i can fit ??
mr_il@reddit
You can fit the new Qwen3.6 MoE, look for Q8_K_XL quant
Cute_Obligation2944@reddit
https://www.reddit.com/r/LocalLLaMA/s/Pkd2nwITXm
Cold_Tree190@reddit
Bro you have the same specs as me, thank you for asking this 😭🙏
BigYoSpeck@reddit
Qwen3.5 122b
But, unless they release a 3.6 version I wouldn't bother. Qwen3.6 35b is faster and from my general testing more capable
And Gemma 4 or Qwen3.5 27b can fit entirely in VRAM
My opinion is all of the chonky boy 120b+ models are currently obsolete between Qwen3.5 27b, Gemma 4 31b and Qwen3.6 35b unless you have sufficient memory capacity to run Minimax 2.7 without dipping into the low bit quants. The Q3 of that I tested was slow, spent forever reasoning, and still came out behind Gemma 4 or Qwen3.6
Guanlong@reddit
The core idea to run MOE-Models in LM-Studio is just max out GPU-offload (ignore VRAM) and then push enough MOE weights back to the CPU. e.g. these are my settings with 32GB system RAM and 12GB VRAM:
https://i.imgur.com/AufVNaG.png
Runs at 40t/s.
With 16GB VRAM you probably need to force fewer MOE weight to the CPU, maybe try 24.
undernetman@reddit
Much faster than 3.5-A3B, but nothing to compare against 27B on complex task and concept of repeated series of operations.
odikee@reddit
It failed to build a simple ios swift application where 3.5 27b and gemma 31b could with the same promt
RnRau@reddit
How many times did you try the prompt on each model?
simracerman@reddit (OP)
Do you mind sharing the prompt?
GrungeWerX@reddit
Yeah, but you’re using the q4 27b. The q5/q6 are so much better! Those are the only ones I use. The q6 is too slow for regular use, but the q5 is definitely doable at 26 tok/sec at 100k ctx.
Main_Secretary_8827@reddit
What hardware?
GrungeWerX@reddit
RTX 3090 TI
Main_Secretary_8827@reddit
On my 4080s, how many tok/s do you think I can get
simracerman@reddit (OP)
Definitely, though consensus in this sub was MoEs are the ones that benefit the most from higher quants. Dense models do fine at Q4 usually.
GrungeWerX@reddit
Ultimately gotta test it for yourself. I heard similar, but found the Q5 and up a noticeable leap in quality, so I switched and never looked back. :)
simracerman@reddit (OP)
Putting your advice to test as we speak.. Downloading Qwen3.5-27B-Q5_K_M from Unsloth. let's hope this does better than Qwen3.6-35B-3A
GrungeWerX@reddit
I'd recommend the Qwen 3.5 27B Q5 UD K XL. that's what I use. Should squeeze out a bit more quality, with minor size/ram penalty (if any).
simracerman@reddit (OP)
I tried the Q5_K_M variant and it tanked my Tg to 8 at 50k context vs 14 for the Q4_K_M. Remember my 5070 Ti has 16GB, and every offloaded layer takes a big toll on speed. I know what Unsloth's dynamic quants do, and while they are pretty good for general knowledge, they don't impact coding all that much. Checkout their performance studies. the Q5_K_M and Q5_K_XL are quite in size/performance.
alchninja@reddit
Your numbers seem kinda low, and you can actually claw back a good chunk of the performance by offloading more layers to the CPU and increasing batch size. I'm on a Ryzen 5700x, 32GB DDR4, 5070 Ti and slightly hamstrung by my motherboard's PCIe3 . I'm able to get around 25-35 tps with the Q5 and 40-50 tps with Q4, both at 120k context size (tps starts high and degrades linearly as the context fills up). Try something like:
-b 4096 -ub 4096 -ngl 999 --n-cpu-moe 20 -fa on -ctk q8_0 -ctv q8_0 --no-mmapsimracerman@reddit (OP)
I was talking about the dense 27B. You have are running the Q5 of an MoE which mins runs at 50t/s.
alchninja@reddit
Ah my bad, I misread your comment.
IrisColt@reddit
I was expecting this...
jadbox@reddit
Q5_K_S should be pretty close to Q5_K_M if you want to go smaller.
ga239577@reddit
I have to say after using 3.6 for a while ... the instruction following and/or understanding what I'm saying seems to be worse than 27B.
The model comes up with solutions that don't make sense and show that the model doesn't really understand my prompt very well. I have had to send multiple prompts to the model and really hold it's hand to get it on the right path.
Waiting for 3.6 27B instead ...
simracerman@reddit (OP)
Someone mentioned that going from Q4 to Q5 with Qwen3.6 makes a big difference. Try that. Mine is Q5_K_XL, and have 0 issues with prompt comprehension.
ShreeyanxRaina@reddit
Have you tried openclaude
simracerman@reddit (OP)
Nope. I run opencode server on my PC, use the desktop version on PC for a nice UI, and web version on mobile.
julianmatos@reddit
Can confirm, the jump from 3.2 to 3.6 is noticeable. I've been using it for code review and doc summarization tasks that used to feel like a stretch for local models.
If anyone's wondering whether their setup can handle it before committing to the download, localllm.run is handy for checking hardware compatibility with specific models and quant levels.
misha1350@reddit
3.2? You mean LLaMa 3.2?
tecneeq@reddit
localllm.run numbers are strange. I get 170 t/s on a 5090 with qwen 3.6 Q5_K_XL.
It says 80 at Q4.
tarruda@reddit
This 3.6 release is so much better that it makes me think the 3.5 releases were rushed.
ag789@reddit
I tried playing with MCP, giving it access to some shell commands, e.g. df, du, etc.
It did a fairly nice summary of disk usage , critical low space mounts / partitions etc, and give some hints about freeing up space, hints / guesses about what directories possibly contains.
quite useful for a 'lazy sysadmin' :)
YouthMoist328@reddit
I am glad to hear this model is worth the download and trial run. Has anyone here ran the Q8 model is it any good?
philmarcracken@reddit
I only have a 12gb vram system with 32gb ram free, can I squeeze in a Q4?
simracerman@reddit (OP)
Absolutely, and maybe with 128k context window since Qwen don’t take up that much space.
Jolly-Parsley-989@reddit
How about vs. Gemma-4-31B?
simracerman@reddit (OP)
Sadly that ran too slow on my machine even at Q4 to do anything productively.
Radiant_Condition861@reddit
I'm having mine reverse engineer a fincnail transactions database. it's different and I'm still getting used to it. here's my vllm docker for a dual 3090 with nvlink if anyone needs a leg up. It's not fully optimized but it's working and stable. some tool calling issues still in open code.
havnar-@reddit
Switch to pi. I had many tool call issues with opencode and Claude
Radiant_Condition861@reddit
switched to pi, I like it, but had to remove the qwen3.6-35b-a3b from rotation. Underpowered for my work. reverse engineering a financial database (gnucash) requires 4-6 disciplines, and it's just not up for the job. Most software projects generalize and gathers abstractions, financial data cannot abstract and therefore gather exceptions. At some point, it cannot keep all the software engineer abstractions and the financial engineer exceptions in context for a data pipeline. good test though.
Radiant_Condition861@reddit
whoops.... Time to update the guardrails ... It's really the first time it's apologized. That's interesting; accountability?
ag789@reddit
it is "strange" that the llm executed the commands and subsequently discover the "mistake".
in a striped down Qwen 3.5 REAP 28B A3B model I used, I've seen it say that it "updated' a fragment of codes and actually just copied the old codes verbatim.
LLMs has 'obscure' bugs, some of those 'mistakes' that 'shouldn't happen' happens.
Radiant_Condition861@reddit
had to increase the output tokens in opencode.
AfterShock@reddit
I would hope so
ArtifartX@reddit
How much is being offloaded to RAM to get any meaningful context length for coding on a card like that? On a 24GB card I am using a Q4 XS and it barely fits in the card with a large context window.
simracerman@reddit (OP)
Here's the command, and breakdown of memory:
The actual weights are 26.6 GB. It gets even more interesting for the offloaded weights as my DDR5 system memory is fast @ 8000 Mt/s, but this is a eGPU setup via Oculink which is capped at 64Gb/s. So my memory speed is not served to it's potential.
Coding is absolutely a delight at 50t/s Tg, the processing could be better of course, but it chews through the first 10k tokens Opencode feeds at a cold start in \~30 seconds, then llama.cpp does it's best to minimize reprocessing via caching. I rarely have to sit for more than 2 mins to reprocess an old session. For a single user at home, it's quite sufficient. I get slightly better processing speeds with the 27B, but 15t/s for generation and that was fine too.
If I needed this for Production, I'd hook another similar card or get a 4090 to speed things up.
ArtifartX@reddit
That's awesome, thanks for the info.
sicutdeux@reddit
48gb vram rich here, it's been passing my test since I downloaded it, Simple, medium, and hard with fewer errors than others, and my bench included JavaScript, Go, Python, C++ and others
simracerman@reddit (OP)
Wild isn't it!
2x 3090 or different setup?
Lorian0x7@reddit
In my opinion 3.6 35b is just an overtrained slop machine capable of regurgitating overused code. It's not capable of any kind of abstraction out of its boundaries.
It keeps getting stuck in loops while filling context with hundreds of thousands of trash tokens and tool calls.
For example it wasn't capable of creating a wiki from a 300 page document, and every attempt was full of allucinations. On the other hand, 3.5 27b at Q3, did the work staying under 60k tokens with correct information.
JMowery@reddit
This has literally been reported by the Unsloth team and others as an issue with using CUDA 3.12, as it is broken. Also acknowledged by Nvidia, and will be fixed in CUDA 3.13.
To fix, either revert to CUDA 3.11 (Unsloth has posted guides) or use higher quants.
Zarzou@reddit
AFAIK, the issue is with the IQ* models
ag789@reddit
a guess is overfitting is rather rreal, given that new models show progressive 'better benchmarks', e.g. 'fitting' the benchmarks.
There are various others who reported it getting into long 'thinking' , prior cases I had a code refactor go into long 'endless' thinking is when is used a stripped down Qwen 3.5 28B REAP model, it got past that with Qwen 3.5 35B A3B q4_k_m.
A guess is if things that 'worked' earlier in Qwen 3.5 say 35B, and goes into long 'endless' thinking, other than 'difficult' problems, 'overfitting' could be a possibility too.
Comacdo@reddit
I relate to this exact same experience.. it feels frustrating :'(
JMowery@reddit
This has literally been reported by the Unsloth team and others as an issue with using CUDA 3.12, as it is broken. Also acknowledged by Nvidia, and will be fixed in CUDA 3.13.
To fix, either revert to CUDA 3.11 (Unsloth has posted guides) or use higher quants (but you should probably revert in either case).
Comacdo@reddit
Didn't know, thanks ! And may Unsloth ne blessed
Lorian0x7@reddit
I'm experiencing this issue using the Vulkan build of llama.cpp, does this matter in some way? My assumption is that using Vulkan the Cuda issue doesn't affect me and any limitations are just the model itself. I'm wrong?
JMowery@reddit
Sounds like you weren't having the same experience then. This is to the original commenter's mention of getting trash/nonsense tokens.
If you're not seeing complete gibberish, you have another issue, which I can't help with because I only run CUDA. Might be worth mentioning in a different post to get better help.
Lorian0x7@reddit
I'm the original commenter. By trash tokens I don't mean completely nonsensical. just looping around a problem without any real resolution.
Awwtifishal@reddit
what quant did you use?
Lorian0x7@reddit
tried Q3 K_S and Q4K_M, same issue. While 3.5 27b is rock solid in Q3.
Awwtifishal@reddit
Maybe it needs to keep the thinking sections in context to perform well. Add
--chat-template-kwargs '{"preserve_thinking": true}'simracerman@reddit (OP)
I know the feeling, but it seems our experiences are wildly different. Agree on the stability issues but that is a basic task. Something might be off with your samplers? Qwen latest models are extremely sensitive. I posted my llama.cpp command in a different comment, see if that helps.
Lorian0x7@reddit
My Llama.cpp settings are the recommended ones. I think our experiences are different simply because your use case is very common, if you try asking something slightly out of the ordinary, especially with defined information, like extrapolating data from a document it goes wild with allucinations.
Bingo-heeler@reddit
Yo what's your llama.Cpp config 320t/s is dope
simracerman@reddit (OP)
Here:
${llamasvr} -m ${mpath}\Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --no-mmap -c 128000 -np 1 -ncmoe 22 --chat-template-kwargs "{"preserve_thinking":true}" --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence-penalty 0.0 --repeat-penalty 1.0
It offloads to 64GB DDR5 memory.
ga239577@reddit
One thing that stood out to me on the official published benchmarks - 3.6 35BA3B is nearly on par with 3.5 27B for SWE-bench Pro and SWE-bench verified - just a bit worse - but absolutely thrashes 27B on Terminal-Bench 2.0
relmny@reddit
"Yeah, another one of those new shiny model is better than previous SOTA, and I understand why you’d roll your eyes."
because some people here like to whine about everything, specially if it comes form China.
tecneeq@reddit
No whining from many this time. The model is a great upgrade to 3.5 and in general very pleasant to work with.
relmny@reddit
there are some posts about 3.6 that the most upvoted comments is one whining about "always I post like this, every time a new OW is out"... so I'll say, yes, the whining is real and continues (funny thing is that I don't see that when is about a non-Chinese model)
This_Maintenance_834@reddit
Personal feeling is qwen3.6-35b-a3b has difficulty to follow instruction. Particularly it always do thing you specifically ask it to hold off and wait. It is particularly terrible, when you trying to figure out how to tweak openclaw config. When qwen3.6-35b-a3b do thing their own way, they crap out in the config, and the openclaw dies during restart. Now I have to fix it with human hands. Qwen3.5-27b don’t do this regularly.
tecneeq@reddit
Could be a matter of options. Here are mine for working with Hermesagent:
havnar-@reddit
For me Qwen3.6-35B-A3B couldn’t solve issues Qwen3.5-35B-A3B-opus-4.6-distilled could fix, and it’s slower.
So until the distillation mlx model is available, I’m sticking to 3.5.
simracerman@reddit (OP)
Fair. There’s been quite a few issues with recent releases in general. The Unsloth Q5 quant did well for me the first time around for 3.6.
alew3@reddit
Running Qwen3.6 35B-3A with vLLM on RTX 5090, its working great with Claude Code!
Makers7886@reddit
Same, didn't bother to try until last night. It's trading blows with 122b and 27b 8bits on my benches at bf16 and about to compare with an 8bit version now. The 3.5 35b did very poorly on the same suite of benches that the 3.6 just scored on par with both 122b and 27b. So much so it makes me want to re-bench the 3.5 to make sure I didnt mess up parameters/settings because it's such a massive gap. Like the difference between a chat-only bot for fun and getting work done.
simracerman@reddit (OP)
You have your numbers right. 3.5-35B was such a let down. I’m really excited to see if they will release 3.6-27B.