If Dense Models are better for Coding, why are Qwen-Coders MoE?
Posted by LocalLLaMa_reader@reddit | LocalLLaMA | View on Reddit | 43 comments
Hi all,
have been reading here for over two years and finally have a question I can't find an answer to.
Qwen 3.5 27B and Gemma 4 31B have been the latest examples of dense models performing much more accurately and in general tasks requiring higher precision, where vast knowledge isn't of highest priority. Hence, I wonder what specifically made Qwen (as the only known developer of coding-specific models) choose their 30B MoE, and the subsequent 80B A3B super-sparse MoE, as the suitable architecture to fine-tune into a coding model? What are these models using the experts for, I certainly don't think each expert is their own language/syntax...
Why did they not proceed on the 27B for example? Or even the 9B dense?
I can only assume it has to do with inference speed, both PP and TG is certainly much slower on the dense models. I am hence even more sad that they didn't release a 14B successor, something that could run on 16GB VRAM quantised with ample room for context.
Any insight would be highly appreciated.
Mashic@reddit
It will take much more time and energy to run inference on an 80B dense parameter model than 80B-A10B MoE one.
If you can get 95% of the same result while being 8 times cheaper and faster, it'll be worth it.
LocalLLaMa_reader@reddit (OP)
fully agree, and of course, but why train an 80B MoE if you could have trained a 15B dense? Not a rhethorical question, genuinely curious...
DinoAmino@reddit
Great question. Training for a specific task, like text-to-query on a specific schema would be better on the 15B in terms of time and resources. The extra parameters won't buy you much for most tasks. So I say, why not use 7B or 8B instead of 15B and cut training by half?
Mashic@reddit
The model knowledge: coding, science, languages... is tied to the number of parameters. So there is a higher chance that the 80B model has the information you ask for internally, or was trained on the type of problem you're trying to solve.
KickLassChewGum@reddit
That doesn't really help too much when the number of activated parameters per forward pass is hardcapped. 3 billion activations will, by definition, not be able to add as much data to the residual stream as 30 billion, even if you route different experts for each layer. Inference is still a one-way pass - you pass through each layer once - and the difference between a dense model and an MoE is that a 48-layer 30B dense model will feature ~600 million activations per layer while the equally sized MoE at 3B can only ever contribute a tenth of that to the residual stream.
HOWEVER, it's also basically a proven fact that dense models are wasteful and certain parameters might actually hurt certain outputs rather than helping them. So the truth is somewhere in the middle. MoE's aren't inherently better than dense models nor vice versa, but the optimum architecture is probably somewhere in the middle. They've been testing hybrids (dense layer checkpoints with MoE layers in-between) and that sure looks promising. So who knows?
LocalLLaMa_reader@reddit (OP)
Hmm fair enough, thank you!
yensteel@reddit
Then it loses the knowledge for it to reference. The most knowledge, the more flexible it is in accomplishing your task. The model size gives it capability in more programming languages, for example.
AI_Conductor@reddit
undefined
FullstackSensei@reddit
Remember Devstral 2 123B? Looking at Unsloth's GGUFs, it has been downloaded 6.2k times. Four or five times of those have been me.
Compare and and contrast with Qwen 3.5 397B, which was released two months later and is over 3x larger, yet the unsloth GGUF has 93k downloads.
Q4 vs Q4, you can run Qwen 3.5 397B even with a single 24GB GPU and still get faster than reading speeds (>10t/s) if you have 192-256GB DDR4 Epyc or Xeon Scalable. Devstral, meanwhile is pretty much unusable on the same setup.
a_beautiful_rhind@reddit
Mistral has shit marketing and the template I got with it was broken. I think many people gave up on it.
Even with 4 gpus, qwen is much much slower than devstral. Easier to buy a system to offload it fully than the qwen. Q3K is almost 200gb. Without reasoning, my 22t/s scrapes the bottom of the barrel for agentic coding. With the reasoning, it would be like forget it. Let alone the prompt processing hit.
I think not all is so rosy in MoE land. If you can fully GPU the dense, it's generally better. Now that ram is more expensive its not that big of a bargain either.
FullstackSensei@reddit
I already have the RAM and TBH I'm quite happy even with 15t/s with Qwen.
I honestly don't understand this need for high t/s for agentic coding. The whole point, for me at least, is offloading work the same way I'd do with a junior dev. If I need to babysit the thing, it kind of defeats the whole purpose of it.
I give Qwen a ton of documentation about the project and give it a specific task with clear boundaries and let it do it's thing for half an hour to more than one hour. I don't check on it at all, and go live my life. When it's done, I review the code the same way I'd review a PR from a junior dev. I do quick fixes myself or give it detailed instructions for how to fix things if it's not something that takes a minute to fix. Either way, I don't want to baby sit.
I can run minimax at 30t/s fully in VRAM on my Mi50s, but now I prefer to run two instances of Qwen 3.5 397B in parallel, one on each CPU + three Mi50s. Here I get ~13t/s, but now I can let it loose on two projects in parallel, three if I also run my epyc + 3090 machine.
RAM is a lot more expensive, but you can still get 192GB DDR4-2666 for ~1k to pair with our trusty Xeon ES. Add in a triplet of P40s and some bits and bobs, and you can still make a rig capable of running Qwen 3.5 397B at Q4 for ~$2k, or less than a single 4090 without anything else. It won't win any speed records, but it'll definitely get shit done.
a_beautiful_rhind@reddit
Sadly you do have to approve some things and there's a bunch of context swapping and compacting.
Once I got used to 40t/s on big models that low t/s doesn't look so good anymore. It would have to be a lot smarter to make up for it and between the 397b/devstral it kinda isn't.
2k is being optimistic now. The X11 boards went up. Mi50s went up.
FullstackSensei@reddit
I let it read and edit files without approval, and the task I give is limited to changing the code. It's enough to get all the changes done unattended.
I'm building my own harness using this method where I can automate things like running tests (without the LLM) on task completion and report back errors in a new task. Even extending that to a sub-agent that can Google, download and compile relevant info (ex: documentation).
FullstackSensei@reddit
P40s are down. Three are enough for 397B plus 180k context.
Boards can still be found for under 200 if you're not very picky. Cast your net wider than supermicro, or look for boards with minor defects. I got an asrock rack a few months ago for 70 on ebay because it has a broken VGA connector. I use IPMI anyway, so it doesn't make a difference for me.
Don't forget that LGA3647 was also used for Xeon-W, which gets a lot less attention.
Rim_smokey@reddit
My Qwen3.5 35B-A3B solves bugs more reliably than the 27B dense. Don't ask why. I don't get it either. But it's consistent. It's also faster and less resource intensive. It's really just a win-win
oxygen_addiction@reddit
What language/framework, etc. ?
For C++ it's not even close. 27B destroys the 35B-A3B in my usecases, which is to be expected considering the almost 10x number of active experts at any given time.
Rim_smokey@reddit
My tests have only been python and javascript. I've used both Agent-Zero, OpenHands and Opencode. The results across the board is that the 27B model at Q6 is marginally more precise with tool calls and remembering to use the correct case ("A" vs "a"), but that the 35B-A3B model at Q6 has a clearer goal of the task at hand and therefore makes fewer mistakes when it comes to what actually needs to be done, resulting in fewer attempts needed.
Whereas 27B makes fewer syntactical errors, the 35B-A3B makes fewer architectural errors. I run the 35B at Q6 to get rid of most of the syntactical errors. Gonna try Q8 soon to see if the slightly reduced quantization might perhaps get rid of all of them, albeit for a disproportional increase in resource usage.
I am myself astonished by these results as well, given the >3x fewer active parameters. It could also be that the 35B handles quantization better. Which would also be counter-intuitive.
I'm working on making a more formalized test suite to verify these results.
mbrodie@reddit
I’m running the 35b Q8 currently on 70tps on 48gb of vram and it’s honestly pretty fantastic it’s caught multiple edge cases in my c++ codebase that frontier models missed, I’m actually surprised how well it functions!
anzzax@reddit
You have to compare 27b to 120b moe with 10b active, yes 120b requires much more VRAM but if you have HW it gives two times more throughput of PP and TG in multi user setups and requires two times less electricity. Qwen3-Next-Coder has only 3b active parameters so even more effective.
Momsbestboy@reddit
You see: "in my case"? This is the correct answer here. Whoever looks at the models should forget about most of the official stats and test them. Like here - in some cases 27b is better, in other 35B-A3B.
g-nice4liief@reddit
This is the way imho
Lissanro@reddit
Dense model is always a bit smarter than MoE of the same size, but also more than an order of magnitude slower. For small models, saving memory may be important for certain use cases, but generally, the performance what matters the most.
For example, Qwen 3.5 397B is usable thanks to being MoE, even with being mostly offloaded to RAM it maybe 2-3 times slower as Q5 than 27B 8-bit in full VRAM (I have 96 GB made of 4x3090), while being much smarter, especially with longer prompts and complex instructions. But 27B is very good choice if you have low memory, especially given the current RAM prices.
oxygen_addiction@reddit
What speed are you getting with Q5 397B on the 4x3090's?
Lissanro@reddit
In short, Qwen3.5 397B Q5_K_M with llama.cpp (CPU+GPU): prefill 572 t/s, generation 17.5 t/s.
If interested to know more, I have shared details about my rig here, and here I shared my performance for various models (including ik_llama.cpp and llama.cpp performance comparison).
oxygen_addiction@reddit
Thanks.
Embarrassed_Adagio28@reddit
It is way more complicated than "dense is better for coding". Dense models are better at one shoting code. However moe models can usually achieve the same or better results if you prompt them step by step and fix issues as they arise. This helps it focus on the correct experts. This is always why it looks worse in benchmarks because of how they are prompted. Qwen 3 coder next is still better in many situations for me than qwen3.5 27b.
Imaginary-Unit-3267@reddit
That's actually really interesting. I have noticed that coder next tries to one shot everything, but it's not very good at it. Do you have a specific workflow that you use? The best I've found so far is telling the model to first make an overall plan without code, then focus on one small atomic commit-sized initial task and then tell me how to test it, and ignore everything else until the test passes, then move on to the next small step, etc.
Looz-Ashae@reddit
That's a useful piece of intel
havnar-@reddit
Moe is 10x faster
jacek2023@reddit
Dense and MoE are different architectures.
27B dense means that at each step, all 27B parameters are used in the final calculation.
26B A4B means that the model has 26B parameters in total, but only 4B are used at each step.
MoE is a way to run models faster while still keeping big knowledge encoded
As I said before, I do not really understand what “better” means in the context of LLMs, because they are too complex to compare directly. People trust benchmarks, I don't. I just test models on my own use cases.
Minute_Attempt3063@reddit
I think they both have uses.
Depending on the situation, dense is better, and vice versa
LocalLLaMa_reader@reddit (OP)
The whole point is that from what I see, you'd rather desire cohesion and great "needle-in-a-haystack" performance and good reasoning to yield a strong coder, hence "big knowledge" isn't necessarily what you desire. This is exactly why I wonder why MoEs were used.
Of course I wound't compare an A3B vs a 27B on coding tasks and I am not proposing 123B-sized dense models, cough, but I would have thought to train a coding model on the basis of a 15-25B dense model for those objectives above. The other answer shed further insights, for which I am very thankful...
Minute_Attempt3063@reddit
LLM's can be used for so much more.
mass image caption, helping people learn in their own way (hint, most ways that schools teach for learning, do not work for many people), perhaps best of all, able to extract data from thousands of files and making a quick summery of it.
Mashic@reddit
LLMs are not just for coding. Some are for data extraction. Like for example you feed them a ton of notes on a topic and you need them to summarize it, or convert some text information to a table or json file...
In that case, the Gemma-4-26B-A4B will be as fast as the Gemma-4-E4B while being more accurate.
H_DANILO@reddit
The one thing that most people dont understand it...
Yes, dense uses all parameters, but that doesn't mean that all parameters contribute to the final outcome, some parameters, actually, many, are zeroed out, so there's a natural clustering happening inside the dense models AS WELL.
This cluster might or might not be happening, but on practical terms, they are happening.
Limp_Classroom_2645@reddit
Locally i use moe for coding because they are more efficient on the resources and i can fit more context, and they are very consistent at using tools solving problems. I tried dense models locally for coding assistants, never had any success
Adventurous-Paper566@reddit
You need high speed and high context.
Olbas_Oil@reddit
Because they want their models to be used, and the dense models require specific hardware to run that most people do not have or can afford.. Its now turning into a popularity race....
shing3232@reddit
that's not necessary the case
qwen_next_gguf_when@reddit
Next coder runs 52 tkps on my rig and 27b 42 tkps. They are in most cases interchangeable except with opencode.
Specter_Origin@reddit
3 Reasons:
Inference: MOE would be cheaper to serve
ROI: You can train larger MOE to have on par quality of dense, which on top of first point makes better financial sense
Hardware: With china being restricted on latest nvidia products they would prefer MOE to achieve higher inference speeds
MaybeIWasTheBot@reddit
the reason is almost purely economical. MoE models are cheaper and faster to both train and serve, while achieving results similar to that of dense models. the initial issue with MoE was that it had training instability, but most labs i assume have made enough progress on that to stop it from being a roadblocker.
Qwen's 30B models were basically the go-to for anyone that has limited VRAM, which is by far most people. for example i have 8GB of VRAM and 32GB of RAM, which means running most dense models (even quantized) at decent context length is out of the question if i wanted to operate at anything more than a few tokens per second. MoE gives me intelligence approaching that of a 30B model that runs at 16-20tps at low context.
the 80B-A3B MoE is kind of experimental. Qwen3-Next as a whole was experimental, and my guess is that once the Qwen team saw some success in it they decided to try a coding finetune next considering that a good amount of people were running it. if their new architecture at the time couldn't handle programming tasks, it was something they needed to know ASAP
tl;dr dense != better && MoE is just much more economical + the Qwen team needed data points
DarkArtsMastery@reddit
Performance. Agents can self-iterate with tools quite well to achieve the same as dense moreless, the bottleneck is how fast. That's where MoE shines as it can deliver way more tokens per sec.