Dense vs. MoE gap is shrinking fast with the 3.6-27B release
Posted by Usual-Carrot6352@reddit | LocalLLaMA | View on Reddit | 82 comments
27B Dense vs. 35B-A3B MoE):
- Dense still holds the crown: It still wins out on most tasks overall.
- The gap is closing: In 7 out of 10 benchmarks, the MoE model is quietly creeping up and closing the distance.
- Coding is getting a massive boost: MoE is making serious strides here. For example, the dense model's lead on the SWE-bench Multilingual benchmark dropped from +9.0 down to just +4.1.
- The one weird outlier: Terminal-Bench 2.0. For whatever reason, the dense model absolutely pulled ahead here, widening its lead from +1.1 to a massive +7.8.
TL;DR: Dense is still technically better, but MoE is catching up fast—especially for coding. If you're running on 24GB VRAM and want massive context windows, the trade-off for MoE is looking better than ever right now.
Thoughts?
Anyone tested the 256k context on the MoE yet?
More details.
Check more details in the link: https://x.com/i/status/2047004358500614152
mindwip@reddit
I think better to compare 122b to 27b.
At the normal high end, You either have a 24gb to 32gb nice gpu or a apple/strix halo 128gb+
Cant wait to compare 3.6 27b to 3.6 122b!
NNN_Throwaway2@reddit
There will be no 3.6 122b.
mindwip@reddit
How you know that?
NNN_Throwaway2@reddit
The blog post for the 3.6 27b implied that they are done releasing models in the 3.6 family.
HadHands@reddit
Where did they imply that - I just read it https://qwen.ai/blog?id=qwen3.6-27b and it even ends with "Stay tuned for more from the Qwen team!"
Different article maybe?
relmny@reddit
Nowhere, or the same place from where some people here, just a few weeks ago, were claiming that qwen was done releasing OW models. And they were pretty sure about that.
NNN_Throwaway2@reddit
In other words, they view the 3.6 family as "comprehensive," which essentially means complete. "Range" also implies an even distribution without gaps that need to be filled. "Now offers" implies that, prior to the introduction of the 27b, these qualities weren't satisfied prior to the 27b release.
Compare this with what they said in the 35b blog:
A very unambiguous statement of intent to release more 3.6 models.
Again, re-stating that there will be more 3.6 models.
I suppose you could argue that the 27b blog post doesn't explicitly rule out more 3.6 model releases, but the shift in language is absolutely there.
If they were planning to release more 3.6 models, you'd think they would say so. Instead, their phrasing very much implies the opposite.
mindwip@reddit
Thanks, sadness
Expensive-Paint-9490@reddit
Qwen is following its plans to abandon FOSS contributions quite fast.
sn2006gy@reddit
yeah, i really want a qwen3.6-coder-next 80b
paperbenni@reddit
Isn't qwen 3.5 basically qwen-next but in different sizes?
sn2006gy@reddit
i mean, its qwen... but 80b is drastically different than 35b.. by like 45b.
AppleBottmBeans@reddit
Big isn't always better. Sometimes it is, and it's why my ex left me. But not always
sn2006gy@reddit
It's MoE, more experts for more languages beyond being good at Python for example.
kurtcop101@reddit
That's not how the experts work!
sn2006gy@reddit
i over simplified it but 80b definitely does golang better than 35b where both do python fairly well, 80b having more experts helps out. specialization is an emergent property and you can train an expert in isolation and add experts or double them, add noise to new experts, train and get new experts based on training set.
AvocadoArray@reddit
Honestly, a 122b fine tune would probably perform better and be cheaper to train.
Blues520@reddit
I too, also want this as well
ionizing@reddit
122B club member here, hoping
ElementNumber6@reddit
All this means is that we need better tests.
Embarrassed_Adagio28@reddit
After running my own limited coding and agentic coding tests, I honestly cant tell the difference in quality between 3.6 35b q5 and 3.6 27b q5 but the 35b is 3x faster. The moe model is so good and fast that I just canceled my claude pro subscription because I am getting better results than sonnet.
Usual-Carrot6352@reddit (OP)
happy to hear that you saved a lot of money. Here's a Pelican for you from today's Qwen3.6-27B-GGUF:Q4_K_M
uutnt@reddit
No doubt model providers are benchmaxing on this.
Fantastic-Balance454@reddit
They definitely are, I got pretty much the exact same SVG, bird positioning is the same, clouds and everything. GLM 5.1 has the exact same layout as well, tho it did add nice gradients and animations to it.
Internal_Werewolf_48@reddit
I don't think so, or at least not this specific pelican on a bike prompt. Ask it weird riffs on this idea (lizards on skateboards, pigs on a pogostick, a cheeseburger taking homework notes, a hotdog army marching in formation, use your imagination) and it's dramatically better at anything you can think of than models were capable of 6 months ago.
krzyk@reddit
So Claude forgot about this.
Sir-Draco@reddit
I think this is one where even if they are the model will gain a bit of generalizable spacial reasoning even if just a little. So not too mad about it
havnar-@reddit
My qwen 3.5 and 3.6 both 35b a3b mlx drew identical pelicans
DOAMOD@reddit
The 27 is actually quite a bit better. I've been working with it for several hours, and the difference is noticeable in something you realize very quickly: the 27 doesn't have to exert much effort, it works well and makes almost no mistakes, while the 3.6-A3 has to struggle, consuming an overwhelming amount of context and making many more simple errors. They're both truly incredible, and I love them, but clearly the a3 reaches its level through a lot of effort, and that's no small feat.
IrisColt@reddit
Absolutely this... The 27B's thought process operates with unrelenting, confident energy, heh
lemon07r@reddit
Sonnet 4.6 and opus 4.7 are both garbage so not high bars to clear sadly. Not sure what happened, they had good models then decided to start shafting their users. At least you found better alternatives. I like kimi k2.6 but I cant run it on my pc, and GPT is also still good, but those all cost money so I havent really found a way to save yet.
ionizing@reddit
I'm noticing 3.6-27B seems to understand the system prompts a bit better vs 3.6-35B. I usually use 122B for real work. But the 27B figured out parallel tool execution which is mentioned in the prompts, whereas the 35B likes to send tool calls one at a time. the screenshot shows 27B making batched tool calls (which are executed in parallel and returned to the llm as one return), you can see it by the timestamps. if this was 35B it would send singular tool calls and you would see different timestamps for each call. So far that is the most interesting observation I have. I need to put it to some real tests next. But that is a promising start, it can 'reason' enough to understand when to batch tool calls whereas the moe tends to ignore that most of the time? But yeah I like the moe typically, I may need to simplify the note about parallel tool calls in the system prompts so the moe make more use of it.
Mistercheese@reddit
I'm curious if you've tested them at longer horizon tasks and larger context sizes like 100k. Anecdotally i heard this is where the dense pulls ahead, and I'm curious if that's really true from your experience too.
eclipsegum@reddit
Fantastic news for Mac owners. Need to get one now before everyone decides to get one
Mr_Hyper_Focus@reddit
lol. Too late buddy.
eclipsegum@reddit
Is it too late to casually pick up a 512 at the Apple Store?
IrisColt@reddit
it's ogre
paryska99@reddit
The 512 ones were discontinued from what I recall
eclipsegum@reddit
We didn’t know how good we had it in the good old days
Mr_Hyper_Focus@reddit
Dude I was gonna pull the trigger on the $400 Mac mini. Those days are gone
Cold_Tree190@reddit
Yeah they were quietly pulled like a month or two ago, and then recently I think they pulled the 256. My guess is they want to save them for the M5’s that are rumored to have been pushed back to the Fall, but idk
WeGoToMars7@reddit
People got their orders of 256 GB one cancelled, so it might be on the way out too...
IrisColt@reddit
heh
cmclewin@reddit
Could you explain why is this good for Mac owners? My initial assumption was that this was a good sign for high RAM / “low” VRAM setups but evidently not haha
Evening_Ad6637@reddit
Oh yes, that’s exactly it. Macs aren’t quite comparable, since they use Unified RAM, but for simplicity’s sake, you can think of it as very fast RAM (which is essentially what it is). So the bandwidth is there, but unlike NVIDIA GPUs, for example, they lack computational power.
Prompt processing therefore remains a bottleneck on Macs, which is why MoEs are more attractive to Mac users.
eclipsegum@reddit
LLM inference is memory-bandwidth bound during token generation. The formula is simple:
tokens/sec ≈ memory bandwidth (GB/s) ÷ model size (GB)
So a 70B Q4 model (~40 GB) on M4 Max: ~546/40 ≈ 13-14 tok/s theoretical max (real-world: 11-12 tok/s)
Massive headroom = unified memory architecture lets you load models that won't fit on consumer GPUs (70B+ on 64-128GB Macs vs. 24GB VRAM limits on RTX 4090).
MoE speeds things up via sparse activation:
On Apple Silicon:
Apple Silicon's GPU has decent compute but can't match high-end NVIDIA GPUs (H100, A100)
MoE helps here since fewer FLOPs per token during generation
Basically Apple Silicon's unified memory architecture provides exceptional bandwidth which is the primary bottleneck for LLM inference. However, token generation remains bandwidth-limited, capping speeds at 15 tok/s for dense 70B models. Mixture of Experts architectures dramatically improve this by activating only 2-10% of parameters per token, effectively reducing the bandwidth requirement and enabling faster inference or allowing larger models to run at the same speed.
NairbHna@reddit
Never thought I’d see the day moe gap be used in a AI setting
Mart-McUH@reddit
I do not know those coding/agentic benches, as that is irrelevant to me. But main advantage of dense was always intelligence and long context understanding of subtleties/relations etc. I think neither of these benchmarks tests for that. Whenever I try small active params MoE it is still the same story - in long multi turn chat it just gets confused and inconsistent quickly.
IMO the gap is real and you can't really remove it as long as you improve both dense and MoE, dense is simply mathematically better, MoE is just attempt to approximate it as well as possible with less compute, but it is far from lossless.
flavio_geo@reddit
Important to consider how MoE vs Dense behave to quantization, which is not the same; MoE models are more sensitive to quantization
TechySpecky@reddit
Fp8 should be fine though right?
MDSExpro@reddit
That's my findings. 120b at int4 was failing on coding, but on int8 it nailed it in one go.
flavio_geo@reddit
Yes.
Also that is where special quants like unsloth UD makes a diference. The preserve certain weights more
AeroelasticCowboy@reddit
Doesn't seem that bad to me? Though I don't/won't run anything less than Q4KM on any model, I was running Q6 on this model but after seeing this graph I moved to Q5KM and increased context window further to 180k
Healthy-Nebula-3603@reddit
Moe version has a big problem with looping and listening instructions.
Dense is much better in instructions listening and don't looping ( if even starts looping can recognize it and back to normal operation where Moe can't do that )
Lesser-than@reddit
I cant run the 27b, but I can say I have never had any looping or listening problems.
SadBBTumblrPizza@reddit
Did you use the "preserve thinking" chat template kwargs?
Xamanthas@reddit
Slop written comment and self promotion. Gtfo
Shifty_13@reddit
Going just as I predicted in my GPU post.
MoE is the future.
Another prediction of mine was low parameter count models closing on in performance with big models.
So big VRAM pools won't be needed that much.
ItilityMSP@reddit
Moe is not the future It's more difficult to fine tune then dense models.
rorowhat@reddit
Is there an easy way to run all these benchmarks?
Usual-Carrot6352@reddit (OP)
Here's Q5 that fits fully in 24VRAM 65K context. https://huggingface.co/spaces/KyleHessling1/qwen36-eval
ItilityMSP@reddit
Well I think the instructions said you should have a 124k of kv or you will hamper reasoning.
sleepy_quant@reddit
Running the 35B-A3B Q8 fp16 on M1 Max 64GB at \~26 tok/s, haven't pulled the 27B dense yet. Anyone A/B'd both on Apple Silicon? Curious where MoE's memory edge stops being worth the quality trade. On flavio's quant sensitivity point, Q8 feels fine for my day-to-day but I haven't run coding-heavy benches. Anyone know a rough floor where MoE coding degrades faster than dense at same bits? Would love a rule of thumb
ambient_temp_xeno@reddit
But the dense uses less vram, and is less damaged by quanting too.
defensivedig0@reddit
If you have any system ram, you can generally offload quite a lot of the MoE onto system ram while still getting substantially faster speeds than the dense model. So you can run at a higher quant and faster speeds
ambient_temp_xeno@reddit
I guess the massive context window is what I didn't absorb. I forgot just how gigantic that can get.
CountlessFlies@reddit
Right. I’m able to run the 35b-a3b with full 256k context on my 24g GPU. The 27b runs out of memory at around 192k context
Edenar@reddit
the memory usage for context is much higher with the dense one (almost 10x !) so i think the 35B MoE is a better choice for smaller memory pool unless you need very low context.
ambient_temp_xeno@reddit
Is that context or just the checkpoints saved to system ram though? The context vram use seemed very low for me on 3.5 27b.
NNN_Throwaway2@reddit
I tried the 35b when it released and had major issues getting it to understand and follow instructions. Both at full precision. I stick with the 27b.
AvidCyclist250@reddit
when mow is the real hero. with a harness
FissionFusion@reddit
I'd really like to see something in the range of a 30B-A10B MoE. Seems like such a waste when MoEs only use <10% of their total params.
Fantastic-Concern173@reddit
for coding with full context moe is so much better then dense especially for 1gpu
mr_zerolith@reddit
Dense models can be amazing, before i moved up to Step 3.5 Flash, i used to run SEED OSS 36B and that thing was a banger for coding even in IQ4_XS size, if it didn't lack breadth in it's knowledgebase, i'd still be using it
RDSF-SD@reddit
I only ever used with 256k context. No problem at all.
Accomplished_Ad9530@reddit
Did you quant the models for your test?
def_not_jose@reddit
What kind of tasks though? One-shotting flappy bird is one thing, working with >100k context of spaghetti code is whole other thing
stormy1one@reddit
Exactly - this is why hyping benchmarks only goes so far. People need to use both, and then make a decision. Personally, I am sticking with 27B for coding. 35B-A3B spends a bit too much time recovering from mistakes it makes, which negates the speed up IMO. Running Qwen’s own FP8 variants to compare, no KV cache quant.
Alarming-Ad8154@reddit
Differences in scores arent really linear, the difference even between 40% correct and 50% correct isn’t the same as 80% correct to 90% correct in terms of ability. You’d want to model the probability of getting questions correct using something like a logistic curve, which is frequently done with human test scores.
Healthy-Nebula-3603@reddit
Where is 3.6 dense ?
RetiredApostle@reddit
Raw scores per bench would be useful (for that rare case when someone doesn't remember them all).
LegacyRemaster@reddit
Interesting analysis. The MOE architecture is becoming increasingly efficient!