Dense vs. MoE gap is shrinking fast with the 3.6-27B release

Posted by Usual-Carrot6352@reddit | LocalLLaMA | View on Reddit | 82 comments

27B Dense vs. 35B-A3B MoE):

- Dense still holds the crown: It still wins out on most tasks overall.

- The gap is closing: In 7 out of 10 benchmarks, the MoE model is quietly creeping up and closing the distance.

- Coding is getting a massive boost: MoE is making serious strides here. For example, the dense model's lead on the SWE-bench Multilingual benchmark dropped from +9.0 down to just +4.1.

- The one weird outlier: Terminal-Bench 2.0. For whatever reason, the dense model absolutely pulled ahead here, widening its lead from +1.1 to a massive +7.8.

TL;DR: Dense is still technically better, but MoE is catching up fast—especially for coding. If you're running on 24GB VRAM and want massive context windows, the trade-off for MoE is looking better than ever right now.

Thoughts?

Anyone tested the 256k context on the MoE yet?

More details.

Check more details in the link: https://x.com/i/status/2047004358500614152

[-]

mindwip@reddit

I think better to compare 122b to 27b.

At the normal high end, You either have a 24gb to 32gb nice gpu or a apple/strix halo 128gb+

Cant wait to compare 3.6 27b to 3.6 122b!

[-]

NNN_Throwaway2@reddit

There will be no 3.6 122b.

[-]

mindwip@reddit

How you know that?

[-]

NNN_Throwaway2@reddit

The blog post for the 3.6 27b implied that they are done releasing models in the 3.6 family.

[-]

HadHands@reddit

Where did they imply that - I just read it https://qwen.ai/blog?id=qwen3.6-27b and it even ends with "Stay tuned for more from the Qwen team!"
Different article maybe?

[-]

relmny@reddit

Nowhere, or the same place from where some people here, just a few weeks ago, were claiming that qwen was done releasing OW models. And they were pretty sure about that.

[-]

NNN_Throwaway2@reddit

With Qwen3.6-27B joining the roster, the Qwen3.6 open-source family now offers a comprehensive range of models

In other words, they view the 3.6 family as "comprehensive," which essentially means complete. "Range" also implies an even distribution without gaps that need to be filled. "Now offers" implies that, prior to the introduction of the 27b, these qualities weren't satisfied prior to the 27b release.

Compare this with what they said in the 35b blog:

Looking ahead, we will continue to expand the Qwen3.6 open-source family and push the boundaries of what efficient, open models can accomplish.

A very unambiguous statement of intent to release more 3.6 models.

Also, Qwen3.6 open-source family keeps expanding, stay tuned for our future releases!

Again, re-stating that there will be more 3.6 models.

I suppose you could argue that the 27b blog post doesn't explicitly rule out more 3.6 model releases, but the shift in language is absolutely there.

If they were planning to release more 3.6 models, you'd think they would say so. Instead, their phrasing very much implies the opposite.

[-]

mindwip@reddit

Thanks, sadness

[-]

Expensive-Paint-9490@reddit

Qwen is following its plans to abandon FOSS contributions quite fast.

[-]

sn2006gy@reddit

yeah, i really want a qwen3.6-coder-next 80b

[-]

paperbenni@reddit

Isn't qwen 3.5 basically qwen-next but in different sizes?

[-]

sn2006gy@reddit

i mean, its qwen... but 80b is drastically different than 35b.. by like 45b.

[-]

AppleBottmBeans@reddit

Big isn't always better. Sometimes it is, and it's why my ex left me. But not always

[-]

sn2006gy@reddit

It's MoE, more experts for more languages beyond being good at Python for example.

[-]

kurtcop101@reddit

That's not how the experts work!

[-]

sn2006gy@reddit

i over simplified it but 80b definitely does golang better than 35b where both do python fairly well, 80b having more experts helps out. specialization is an emergent property and you can train an expert in isolation and add experts or double them, add noise to new experts, train and get new experts based on training set.

[-]

AvocadoArray@reddit

Honestly, a 122b fine tune would probably perform better and be cheaper to train.

[-]

Blues520@reddit

I too, also want this as well

[-]

ionizing@reddit

122B club member here, hoping

[-]

ElementNumber6@reddit

All this means is that we need better tests.

[-]

Embarrassed_Adagio28@reddit

After running my own limited coding and agentic coding tests, I honestly cant tell the difference in quality between 3.6 35b q5 and 3.6 27b q5 but the 35b is 3x faster. The moe model is so good and fast that I just canceled my claude pro subscription because I am getting better results than sonnet.

[-]

Usual-Carrot6352@reddit (OP)

happy to hear that you saved a lot of money. Here's a Pelican for you from today's Qwen3.6-27B-GGUF:Q4_K_M

[-]

uutnt@reddit

No doubt model providers are benchmaxing on this.

[-]

Fantastic-Balance454@reddit

They definitely are, I got pretty much the exact same SVG, bird positioning is the same, clouds and everything. GLM 5.1 has the exact same layout as well, tho it did add nice gradients and animations to it.

[-]

Internal_Werewolf_48@reddit

I don't think so, or at least not this specific pelican on a bike prompt. Ask it weird riffs on this idea (lizards on skateboards, pigs on a pogostick, a cheeseburger taking homework notes, a hotdog army marching in formation, use your imagination) and it's dramatically better at anything you can think of than models were capable of 6 months ago.

[-]

krzyk@reddit

So Claude forgot about this.

[-]

Sir-Draco@reddit

I think this is one where even if they are the model will gain a bit of generalizable spacial reasoning even if just a little. So not too mad about it

[-]

havnar-@reddit

My qwen 3.5 and 3.6 both 35b a3b mlx drew identical pelicans

[-]

DOAMOD@reddit

The 27 is actually quite a bit better. I've been working with it for several hours, and the difference is noticeable in something you realize very quickly: the 27 doesn't have to exert much effort, it works well and makes almost no mistakes, while the 3.6-A3 has to struggle, consuming an overwhelming amount of context and making many more simple errors. They're both truly incredible, and I love them, but clearly the a3 reaches its level through a lot of effort, and that's no small feat.

[-]

IrisColt@reddit

the 27 doesn't have to exert much effort, it works well and makes almost no mistakes, while the 3.6-A3 has to struggle, consuming an overwhelming amount of context and making many more simple errors

Absolutely this... The 27B's thought process operates with unrelenting, confident energy, heh

[-]

lemon07r@reddit

Sonnet 4.6 and opus 4.7 are both garbage so not high bars to clear sadly. Not sure what happened, they had good models then decided to start shafting their users. At least you found better alternatives. I like kimi k2.6 but I cant run it on my pc, and GPT is also still good, but those all cost money so I havent really found a way to save yet.

[-]

ionizing@reddit

I'm noticing 3.6-27B seems to understand the system prompts a bit better vs 3.6-35B. I usually use 122B for real work. But the 27B figured out parallel tool execution which is mentioned in the prompts, whereas the 35B likes to send tool calls one at a time. the screenshot shows 27B making batched tool calls (which are executed in parallel and returned to the llm as one return), you can see it by the timestamps. if this was 35B it would send singular tool calls and you would see different timestamps for each call. So far that is the most interesting observation I have. I need to put it to some real tests next. But that is a promising start, it can 'reason' enough to understand when to batch tool calls whereas the moe tends to ignore that most of the time? But yeah I like the moe typically, I may need to simplify the note about parallel tool calls in the system prompts so the moe make more use of it.

[-]

Mistercheese@reddit

I'm curious if you've tested them at longer horizon tasks and larger context sizes like 100k. Anecdotally i heard this is where the dense pulls ahead, and I'm curious if that's really true from your experience too.

[-]

eclipsegum@reddit

Fantastic news for Mac owners. Need to get one now before everyone decides to get one

[-]

Mr_Hyper_Focus@reddit

lol. Too late buddy.

[-]

eclipsegum@reddit

Is it too late to casually pick up a 512 at the Apple Store?

[-]

IrisColt@reddit

it's ogre

[-]

paryska99@reddit

The 512 ones were discontinued from what I recall

[-]

eclipsegum@reddit

We didn’t know how good we had it in the good old days

[-]

Mr_Hyper_Focus@reddit

Dude I was gonna pull the trigger on the $400 Mac mini. Those days are gone

[-]

Cold_Tree190@reddit

Yeah they were quietly pulled like a month or two ago, and then recently I think they pulled the 256. My guess is they want to save them for the M5’s that are rumored to have been pushed back to the Fall, but idk

[-]

WeGoToMars7@reddit

People got their orders of 256 GB one cancelled, so it might be on the way out too...

[-]

IrisColt@reddit

heh

[-]

cmclewin@reddit

Could you explain why is this good for Mac owners? My initial assumption was that this was a good sign for high RAM / “low” VRAM setups but evidently not haha

[-]

Evening_Ad6637@reddit

Oh yes, that’s exactly it. Macs aren’t quite comparable, since they use Unified RAM, but for simplicity’s sake, you can think of it as very fast RAM (which is essentially what it is). So the bandwidth is there, but unlike NVIDIA GPUs, for example, they lack computational power.

Prompt processing therefore remains a bottleneck on Macs, which is why MoEs are more attractive to Mac users.

[-]

eclipsegum@reddit

LLM inference is memory-bandwidth bound during token generation. The formula is simple:

tokens/sec ≈ memory bandwidth (GB/s) ÷ model size (GB)

So a 70B Q4 model (~40 GB) on M4 Max: ~546/40 ≈ 13-14 tok/s theoretical max (real-world: 11-12 tok/s)

Massive headroom = unified memory architecture lets you load models that won't fit on consumer GPUs (70B+ on 64-128GB Macs vs. 24GB VRAM limits on RTX 4090).

MoE speeds things up via sparse activation:

Dense model: Every token passes through all parameters
MoE model: Every token activates only ~2-8% of parameters

On Apple Silicon:

Fewer weights to fetch per token results in less memory bandwidth consumed
Same model quality, lower inference cost results in higher tokens/sec
Can run larger total parameter counts while staying within bandwidth budget

Apple Silicon's GPU has decent compute but can't match high-end NVIDIA GPUs (H100, A100)

MoE helps here since fewer FLOPs per token during generation

Basically Apple Silicon's unified memory architecture provides exceptional bandwidth which is the primary bottleneck for LLM inference. However, token generation remains bandwidth-limited, capping speeds at 15 tok/s for dense 70B models. Mixture of Experts architectures dramatically improve this by activating only 2-10% of parameters per token, effectively reducing the bandwidth requirement and enabling faster inference or allowing larger models to run at the same speed.

[-]

NairbHna@reddit

Never thought I’d see the day moe gap be used in a AI setting

[-]

Mart-McUH@reddit

I do not know those coding/agentic benches, as that is irrelevant to me. But main advantage of dense was always intelligence and long context understanding of subtleties/relations etc. I think neither of these benchmarks tests for that. Whenever I try small active params MoE it is still the same story - in long multi turn chat it just gets confused and inconsistent quickly.

IMO the gap is real and you can't really remove it as long as you improve both dense and MoE, dense is simply mathematically better, MoE is just attempt to approximate it as well as possible with less compute, but it is far from lossless.

[-]

flavio_geo@reddit

Important to consider how MoE vs Dense behave to quantization, which is not the same; MoE models are more sensitive to quantization

[-]

TechySpecky@reddit

Fp8 should be fine though right?

[-]

MDSExpro@reddit

That's my findings. 120b at int4 was failing on coding, but on int8 it nailed it in one go.

[-]

flavio_geo@reddit

Yes.

Also that is where special quants like unsloth UD makes a diference. The preserve certain weights more

[-]

AeroelasticCowboy@reddit

Doesn't seem that bad to me? Though I don't/won't run anything less than Q4KM on any model, I was running Q6 on this model but after seeing this graph I moved to Q5KM and increased context window further to 180k

[-]

Healthy-Nebula-3603@reddit

Moe version has a big problem with looping and listening instructions.

Dense is much better in instructions listening and don't looping ( if even starts looping can recognize it and back to normal operation where Moe can't do that )

[-]

Lesser-than@reddit

I cant run the 27b, but I can say I have never had any looping or listening problems.

[-]

SadBBTumblrPizza@reddit

Did you use the "preserve thinking" chat template kwargs?

[-]

Xamanthas@reddit

Slop written comment and self promotion. Gtfo

[-]

Shifty_13@reddit

Going just as I predicted in my GPU post.

MoE is the future.

Another prediction of mine was low parameter count models closing on in performance with big models.

So big VRAM pools won't be needed that much.

[-]

ItilityMSP@reddit

Moe is not the future It's more difficult to fine tune then dense models.

[-]

rorowhat@reddit

Is there an easy way to run all these benchmarks?

[-]

Usual-Carrot6352@reddit (OP)

Here's Q5 that fits fully in 24VRAM 65K context. https://huggingface.co/spaces/KyleHessling1/qwen36-eval

[-]

ItilityMSP@reddit

Well I think the instructions said you should have a 124k of kv or you will hamper reasoning.

[-]

sleepy_quant@reddit

Running the 35B-A3B Q8 fp16 on M1 Max 64GB at \~26 tok/s, haven't pulled the 27B dense yet. Anyone A/B'd both on Apple Silicon? Curious where MoE's memory edge stops being worth the quality trade. On flavio's quant sensitivity point, Q8 feels fine for my day-to-day but I haven't run coding-heavy benches. Anyone know a rough floor where MoE coding degrades faster than dense at same bits? Would love a rule of thumb

[-]

ambient_temp_xeno@reddit

If you're running on 24GB VRAM and want massive context windows, the trade-off for MoE is looking better than ever right now.

But the dense uses less vram, and is less damaged by quanting too.

[-]

defensivedig0@reddit

If you have any system ram, you can generally offload quite a lot of the MoE onto system ram while still getting substantially faster speeds than the dense model. So you can run at a higher quant and faster speeds

[-]

ambient_temp_xeno@reddit

I guess the massive context window is what I didn't absorb. I forgot just how gigantic that can get.

[-]

CountlessFlies@reddit

Right. I’m able to run the 35b-a3b with full 256k context on my 24g GPU. The 27b runs out of memory at around 192k context

[-]

Edenar@reddit

the memory usage for context is much higher with the dense one (almost 10x !) so i think the 35B MoE is a better choice for smaller memory pool unless you need very low context.

[-]

ambient_temp_xeno@reddit

Is that context or just the checkpoints saved to system ram though? The context vram use seemed very low for me on 3.5 27b.

[-]

NNN_Throwaway2@reddit

I tried the 35b when it released and had major issues getting it to understand and follow instructions. Both at full precision. I stick with the 27b.

[-]

AvidCyclist250@reddit

when mow is the real hero. with a harness

[-]

FissionFusion@reddit

I'd really like to see something in the range of a 30B-A10B MoE. Seems like such a waste when MoEs only use <10% of their total params.

[-]

Fantastic-Concern173@reddit

for coding with full context moe is so much better then dense especially for 1gpu

[-]

mr_zerolith@reddit

Dense models can be amazing, before i moved up to Step 3.5 Flash, i used to run SEED OSS 36B and that thing was a banger for coding even in IQ4_XS size, if it didn't lack breadth in it's knowledgebase, i'd still be using it

[-]

RDSF-SD@reddit

I only ever used with 256k context. No problem at all.

[-]

Accomplished_Ad9530@reddit

Did you quant the models for your test?

[-]

def_not_jose@reddit

What kind of tasks though? One-shotting flappy bird is one thing, working with >100k context of spaghetti code is whole other thing

[-]

stormy1one@reddit

Exactly - this is why hyping benchmarks only goes so far. People need to use both, and then make a decision. Personally, I am sticking with 27B for coding. 35B-A3B spends a bit too much time recovering from mistakes it makes, which negates the speed up IMO. Running Qwen’s own FP8 variants to compare, no KV cache quant.

[-]

Alarming-Ad8154@reddit

Differences in scores arent really linear, the difference even between 40% correct and 50% correct isn’t the same as 80% correct to 90% correct in terms of ability. You’d want to model the probability of getting questions correct using something like a logistic curve, which is frequently done with human test scores.

[-]