I like my models dense. Can model makers please bring back or update the dense models from like 2 years ago? A nice 39b or 72b maybe?

Posted by Porespellar@reddit | LocalLLaMA | View on Reddit | 27 comments

Seriously, Qwen3.6 27b is mopping the floor against models like 5 times its size right now. It doesn’t take a rocket scientist to figure out that maybe the whole a2b and a3b MoE thing isn’t the best solution after all. I mean sure MoEs let you run a larger model really fast on a potato PC, but I think we’re learning that there is no free lunch.

As a person who has been on this sub for well over 2 years, I can tell you that despite what benchmarks say, the dense models we seem to have shifted away from because we wanted fast models to run on shitty hardware, those old 35b’s and 72b’s just seemed way smarter when you were talking with them then the benchmaxed crop we have now.

And yes I know access to tools can offset knowledge density to a degree, I know we have tool chains now, and harnesses, and MCP, and web search, but giving a toddler access to Google search or handing it a bash shell doesn’t make it smarter if it doesn’t really know what to do with those tools or understand the output it gets back from them.

Anyways, I’ve tested a ton of models over the last 3 years or so, and I can say without a doubt that a lot of big MoE’s with low active parameters counts don’t seem near as “smart” next to even a small to medium sized dense model. Sure, the speed of MoE’s are great on low resource hardware, but don’t act shocked when a well-trained 27b comes in and leapfrogs the whole pack and don’t be mad because it’s slow AF either. Show that turtle some respect.

For real though, I would love to see more dense models back in the lineup, they’ve obviously shown their potential and value lately.

[-]

Song-Historical@reddit

This works because you can do tricks with KV caching that you can't do with sparse models is my understanding. There was no way to know or optimize for it before.

[-]

Middle_Bullfrog_6173@reddit

What tricks? As long as the models use a similar attention architecture, there shouldn't really be any difference between MoE and dense KV cache.

In practice dense models tend to have larger KV caches since they use the extra active parameters on increasing model depth (more attention layers) and/or embedding dimension (larger KV cache entries).

[-]

Song-Historical@reddit

Because sparse models dynamically decide what parts of the network to turn on or off based on the KV cache, KV cache eviction/offloading becomes a lot harder to get good performance out of, because you're going back and forth over the cache with low cache hits and a lot of overlap/contention. The router that picks which part of the network (the subnetworks we're calling 'experts') to activate makes it hard to predict what part of the KV cache is needed next.

So now you have an active problem with routing that has to follow some black box behavior, which subnetwork needs access to which part of the cache at the same time, what is the most recently used part of the cache to commit to high speed memory, etc, while the model also needs access to the full KV cache with all the tokens used so far that help it maintain context.

In a dense model you have much simpler and predictable strategies for optimizing the KV cache, because every token activates the entire network in a predictable way, so you can make some reasonable guesses like 'least recently used'.

I have some hardware experience and was looking at code for some SmartSSD's in research projects (HILOS from last year) specifically looking at KV cache eviction and that's not that far off from what we're trying to do with RAM on normal computers.

I could be wrong, I'm not an expert, but this is what the problem is as I understand it.

[-]

Middle_Bullfrog_6173@reddit

Sorry, I still don't understand the issue. You always need the KV cache for the attention module anyway and its output is needed for the next MLP layer. It's the same whether that next layer is a dense layer or MoE.

Maybe there are further complexities if you use expert parallelism or something, but there it is the input and output of the experts that need to be shuffled around. I don't really see why KV cache would have to care specifically.

[-]

Valuable-Run2129@reddit

Not even an RTX 6000 Pro can serve a dense 70B model at decent speeds in an harness like Claude Code. What hardware do you have?

[-]

Porespellar@reddit (OP)

My DGX Spark does decent enough speeds for 70b models. Not full precision of course but q8 and q4

[-]

Valuable-Run2129@reddit

What harness do you use?

[-]

misanthrophiccunt@reddit

seriously what do you guys mean when you say "what HARNESS do you use" ?

[-]

Valuable-Run2129@reddit

Claude code, hermes agent, open code… models without harnesses are useless

[-]

misanthrophiccunt@reddit

so you mean agents then

[-]

NNN_Throwaway2@reddit

Nope.

[-]

Valuable-Run2129@reddit

The agent is the result of an LLM using an harness. There’s no agent without an harness.

[-]

Porespellar@reddit (OP)

I'm running vLLM for inference via Sparkrun. I use both Claude Code and Hermes Agents mainly.

[-]

Valuable-Run2129@reddit

What is the speed like in them?

[-]

FullOf_Bad_Ideas@reddit

you can do TP if you have a few 3090s, 4090s or 5090s.

Even Devstral Large 123B can be run locally at usable speeds, if you don't mind feeling like a jet is taking off in the next room.

[-]

exact_constraint@reddit

Let’s get hot swappable macro MOE models going. Think something like 1T A35B, with a 9B orchestrator. Keep the 1T parameters on an NVMe drive. Orchestrator eats the prompt, and swaps in whatever 35B parameters make the most sense to do the work. I’d take the 10 second hit to dump the parameters into VRAM from the NVMe.

[-]

ea_man@reddit

MoE are way easier to train:

Qwen3-Next is trained on a uniformly sampled subset (15T tokens) of Qwen3’s 36T-token pretraining corpus. It uses less than 80% of the GPU hours needed by Qwen3-30A-3B, and only 9.3% of the compute cost of Qwen3-32B — while achieving better performance. This shows outstanding training efficiency and value.
https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd

That's why they prefer those, why 35B A3B released before 27B (I guess).

Yet I agree with you: as an user I like it dense, I'd like a \~24b dense that can run at Q_4_M on a 16GB gpu more comfy, a new release of 14B coder.

[-]

DinoAmino@reddit

MoEs are NOT easier to train. What you are quoting are resource efficiencies which are well understood. But they are less stable due to expert routing shifting issues if the data is too narrow. Really need even more diverse data when you're data is more complex Training MoEs add more complexity - they are harder to train than dense models..

[-]

FullOf_Bad_Ideas@reddit

In terms of FLOPS, training Qwen 3 32B probably uses as much FLOPs as training Kimi K2 1T did. MoEs are expensive and allow companies with less GPUs to make decent models. That's why they're rare. If you were the company training a model, would you rather train 32B dense model or ambitious 1T model?

[-]

jopereira@reddit

Between been able to run a MoE on a "potato PC" or no model at all... let me think about it for a while. I'll come back.

[-]

CalligrapherFar7833@reddit

Noone is saying that moe should not also be released but we are lacking dense

[-]

jopereira@reddit

"It doesn’t take a rocket scientist to figure out that maybe the whole a2b and a3b MoE thing isn’t the best solution after all"

It is "the best" solution for specific cases. Dense models also are "the best" solution for specific cases.
OP already said this, in other words. I think we are on the same page after all ;)

[-]

Porespellar@reddit (OP)

I totally respect that perspective. Small MoE’s have their place and I’m glad they exist, they will democratize AI for a lot of people and use cases.

[-]

ttkciar@reddit

Check out K2-V2-Instruct from LLM360 when you have a chance. It's a 72B dense trained from scratch with a 512K context limit.

It's not great at creative writing, but very smart with logical problem-solving.

[-]

nomorebuttsplz@reddit

Indeed. It would be nice to have Gemma 70 B dense.

However, the 31B is so good that it almost seems like the training pipelines are the current constraint rather than the number of parameters. It’s like better than last year’s Gemini. It actually seems better than Gemini three for creative writing to me.

It seems like whatever Google and qwen are doing could result in a 70 billion perimeter dense model with approximately the intelligence of sonnet.

[-]

CalligrapherFar7833@reddit

Being better at gemini right now is not saying much tho

[-]

ProfessionalSpend589@reddit

I’ll be trying out tonight Gemma 4 26 A4B in BF16 precision. Mainly to see if that’ll fix an annoying grammar mistake I which the UD-Q8_K_XL seems to make and which it can detect when I point it out.

50GB weights. Might post some tests later :)