How do you think a Qwen 72B dense would perform?

[-]

ttkciar@reddit

I kept hoping it would kick ass, and watched QuixiAI's project here, waiting for them to finish theirs up -- https://huggingface.co/QuixiAI/Qwen3-72B-Embiggened **Before you get excited, note that that is *not* a useful model!** They needed to perform a final distillation, as noted in the model card, and never did. I think due to lack of compute resources. That's the bad news. The good news is that K2-V2-Instruct (72B) is basically everything I ever hoped Qwen3-72B might be. It is *astoundingly* competent at a wide variety of tasks, especially at long context -- https://huggingface.co/LLM360/K2-V2-Instruct Its main drawback is that as context grows long, it becomes excruciatingly slow. I've stopped watching QuixiAI's Qwen3-72B project, and have been trying K2-V2-Instruct at various tasks. It continues to impress me anew.

Reply

[-]

AvocadoArray@reddit

Is this model trained from scratch, or some kind of distillation of Kimi K2?

Reply

[-]

ttkciar@reddit

It is trained from scratch by LLM360, which is an open-source R&D lab. The model is fully open-source, with all of its training datasets and software available: https://huggingface.co/collections/LLM360/txt360 -- datasets https://github.com/llm360 -- software The lab itself is a cooperative venture between Cerebras, Petuum, and MBZUAI, which all have different reasons for participating in the venture. Petuum is broadly trying to push a common ecosystem for industrial LLM technology, MBZUAI is part of the UAE's industrial diversification effort, and Cerebras hopes to demonstrate to potential customers that their hardware is worth buying. The K2-V2 architecture is plain old llama; all of their innovation is in the augmented datasets and training methodology.

Reply

[-]

AvocadoArray@reddit

I've been playing with a FP8 dynamic quant and wow, this thing behaves quite differently. I asked it to "tell me a long story about a cat and a dog", and it had two hilarious meltdowns in the middle. >Milo could not coil his very own person; the cat’s human was physically left a small remains of maintenance. The cat’s pain. The cat's head. **It would be enough t… t+** **Conclusion** (—> Actually, I realized I got tangled and must restructure the story. To comply with your request, I will continue the narrative coherently with a proper exposition, conflict, resolution, and an emotional conclusion that ties both characters together in a lasting, heartwarming way.) And then >Milo held nothing but silence. Cats never talk to dogs... Let alone to puppies with his \[multiple lines of gibberish\] Thus Okay… writings Sad (The assistant appears to have gotten lost mid‑type, blending an unfinished narrative with format notes.) Other models tend to deteriorate more meaningless nonsense when this happens to them, but this one actually recovered mid-response and gave a sensible story at the end. Still not sure how useful it will be for real work, but oh man it's funny to watch it go off the rails. I wonder if it's a quantization issue? Full output here: [https://pastebin.com/ipMemjbC](https://pastebin.com/ipMemjbC)

Reply

[-]

ttkciar@reddit

That is really bizarre :-( yeah, maybe it is a quant issue. When I asked K2-V2-Instruct to infer a Murderbot Diaries fanfic, it did a fair job (though a bit saccharine): http://ciar.org/h/murderbot.1774024842.k2v2.txt

Reply

[-]

Current_Ferret_4981@reddit

Probably would be the best selling point for 6000 pros. Right now you can get pretty much full performance of 27B at Q5 that fits on a 5090, and scaling up from there is pretty diminishing returns or better for multi agent setups. A 72B at Q5 with a good ratio of deltaNet connections would likely still have decent speed but would really fill out a 6000 pro vram and performance.

Reply

[-]

Spicy_mch4ggis@reddit

You reckon Q5 is good? I’d get more context window . I use Q6 and get around 80k context on a headless 5090.

Reply

[-]

Lucis_unbra@reddit

Q5 is fine, but Q6 is better. At and above Q5, at this size, it's generally about the margin of error you're willing to accept. If the BF16 makes 100 errors per 1000 prompts, then the Q8 might make 110, Q6 115, Q5, 130 and so on. Obviously not a perfect illustration, and it depends on the usecase and the definition of an error. An error might be small, it might be something that's just suboptimal, or it can be fatal for the task. So the question at Q5-Q8 is, if you've accepted to not have the perfect example of the model, you've accepted at least Q8. How much more are you willing to give up for speed and context? My tests show that something starts happening at Q4 for both 27 and 35B where they can deviate 100% on a token, no chance to match the reference output on that token. Doesn't happen on any Q5 quants. So if we say that Q5 is the last stop where the quant is likely a good representation of the capabilities of the model, but will make more errors due to the increased uncertainty... Are you willing to accept a higher risk of a few tokens messing up the output? Or do you want the model to be safer?

Reply

[-]

Current_Ferret_4981@reddit

Also depends on whose quant you use but I agree. That being said, I also use it for code that I can typically write and gave unit tests for, so error correction takes less than an hour or two even if the errors are fatal. I do think speed, context, and accuracy are the tradeoffs. But you can pretty much have all of those with a 6000 pro and still have room for more. Which would suggest a nice 40-70GB version could be perfect

Reply

[-]

Lucis_unbra@reddit

Sure, the quant itself can matter. It's more a generalization regarding any competent quant in that range. I'd say that if you have even one 6000 pro... That's a serious step up. At that point I'd probably forget all about the 27B model and jump to the 122B at Q4-5 for most tasks if possible. The model is a step above, and will run faster than the 27B. You gain a lot of knowledge, even if it can't be run at Max quality. The 27B at native precision is suddenly also a very interesting option, served with a high performance engine? You got a lot of room there for some secondary models. Image generation, voice, if that's something one cares about, or subagent / task models. For the 35B and 27B however? You can have your cake and eat it..

Reply

[-]

Current_Ferret_4981@reddit

Q5 doesn't seem significantly worse than Q8 from what I see. Could do Q6 with semi smaller context or Q5 with any realistic context. But yeah Q6 and 64k context is a sweet spot if that works for use cases, or Q5 and larger context

Reply

[-]

fulgencio_batista@reddit

With a 5090, nvfp4 is worth a shot, heard it preforms much more similarly to q8

Reply

[-]

-Ellary-@reddit

A Qwen 3.5 72b dense will perform same as Qwen 3.5 220b A20b\~. I'd say it will be around old Qwen 3 235b A22b +20-30%.

Reply

[-]

fractalcrust@reddit

it would be too powerful

Reply

[-]

qubridInc@reddit

A Qwen 72B dense model would probably provide solid, reliable reasoning and coding performance that's similar to top-tier closed models, but it'll come with higher computing costs and be less efficient than MoE setups.

Reply

[-]

Rich_Artist_8327@reddit

But what I have noticed is that with multi GPU setup using tensor parallelism, dense models do work much better than moe for some reason. Do moe models require more bandwidth between gpus?

Reply

[-]

a_beautiful_rhind@reddit

MoE models are usually larger so you have to offload. Dense 70 and 120b run circles around MoE where half of it is sitting on ram. MoE is a tradeoff of memory footprint for compute.

Reply

[-]

toothpastespiders@reddit

I really mourn the near extinction of 70b dense models.

Reply

[-]

JustFinishedBSG@reddit

Slowly

Reply

[-]

Admirable-Star7088@reddit

Generally, I personally like quality more than speed, so I would absolutely dig a Qwen3.5 \~70b dense model. With speculative decoding, 70b dense is not unreasonably slow on RAM (imo), at least not for more casual use cases such as chatting/Q&A.

Reply

[-]

Admirable-Star7088@reddit

Generally, I personally like quality more than speed, so I would absolutely dig a Qwen3.5 \~70b dense model. With speculative decoding, 70b dense is not unreasonably slow on RAM (imo), at least not for more casual chatting, such as Q&A.

Reply

[-]

El_90@reddit

Yes please 94GB please lol

Reply

[-]

ForsookComparison@reddit

A Qwen3.5-72B dense would have potential to be SOTA-at-home in a lot of use-cases. But it doesn't always work that way. Qwen2.5-72B really only beat Qwen2.5-32B in knowledge-depth. It's not an automatic win.

Reply

[-]

mrpkeya@reddit

I like "SOTA-at-home"

Reply

[-]

Expensive-Paint-9490@reddit

Like 397B-A17B, roughly.

Reply

[-]

StrikeOner@reddit

not as good as the 328B dense model!

Reply

[-]

jacek2023@reddit

Qwen said during the release of Qwen 3 that they have no plans to build dense models bigger than 32B (and now it's just 27B)

Reply

[-]

colin_colout@reddit

it's gotta run reasonably well on chinese AI accelerators. MOE is the only real option.

Reply

[-]

Gohab2001@reddit

5tps lol

Reply

[-]

nacholunchable@reddit

In terms of quality, probably better, but i use a dumber 20-40 tps over my smarter 10tps model, so i cant imagine what a modern 72B would give me. I, like you, wish we had access though, because for some agentic stuff i run it overnight anyways. The problem is that the bigger dense models are harder and longer to train, so getting a big dense model doesnt neccessary mean youve got a modern, capable, well-trained big dense model. If we did Id love it for nonchat hands-off stuff. But if i could i only pick one flavor, id pick an moe so i could use the damn thing.

Reply

[-]

SillyLilBear@reddit

Really well but would be slow as hell and would be really hard to run.

Reply

[-]

rhinodevil@reddit

On the low end, at least for my use cases, the dense 9B model performes way better than the 30B and 35B(-A3B) MoE models. The dense 27B model is unfortunately too slow on my consumer-grade hardware. So I guess a 72B dense model would perform much better than 9B and 27B.

Reply

How do you think a Qwen 72B dense would perform?

Reply to Post

32 Comments

ttkciar@reddit

AvocadoArray@reddit

ttkciar@reddit

AvocadoArray@reddit

ttkciar@reddit

Current_Ferret_4981@reddit

Spicy_mch4ggis@reddit

Lucis_unbra@reddit

Current_Ferret_4981@reddit

Lucis_unbra@reddit

Current_Ferret_4981@reddit

fulgencio_batista@reddit

-Ellary-@reddit

fractalcrust@reddit

qubridInc@reddit

Rich_Artist_8327@reddit

a_beautiful_rhind@reddit

toothpastespiders@reddit

JustFinishedBSG@reddit

Admirable-Star7088@reddit

Admirable-Star7088@reddit

El_90@reddit

ForsookComparison@reddit

mrpkeya@reddit

Expensive-Paint-9490@reddit

StrikeOner@reddit

jacek2023@reddit

colin_colout@reddit

Gohab2001@reddit

nacholunchable@reddit

SillyLilBear@reddit

rhinodevil@reddit