If Dense Models are better for Coding, why are Qwen-Coders MoE?

Posted by LocalLLaMa_reader@reddit | LocalLLaMA | View on Reddit | 43 comments

Hi all,

have been reading here for over two years and finally have a question I can't find an answer to.

Qwen 3.5 27B and Gemma 4 31B have been the latest examples of dense models performing much more accurately and in general tasks requiring higher precision, where vast knowledge isn't of highest priority. Hence, I wonder what specifically made Qwen (as the only known developer of coding-specific models) choose their 30B MoE, and the subsequent 80B A3B super-sparse MoE, as the suitable architecture to fine-tune into a coding model? What are these models using the experts for, I certainly don't think each expert is their own language/syntax...

Why did they not proceed on the 27B for example? Or even the 9B dense?

I can only assume it has to do with inference speed, both PP and TG is certainly much slower on the dense models. I am hence even more sad that they didn't release a 14B successor, something that could run on 16GB VRAM quantised with ample room for context.

Any insight would be highly appreciated.