TheaterFire

How do you think a Qwen 72B dense would perform?

Posted by OmarBessa@reddit | LocalLLaMA | View on Reddit | 32 comments

Got this question in my head a few days ago and I can't shake it off of it.

Reply to Post

32 Comments

ttkciar@reddit

I kept hoping it would kick ass, and watched QuixiAI's project here, waiting for them to finish theirs up -- https://huggingface.co/QuixiAI/Qwen3-72B-Embiggened **Before you get excited, note that that is *not* a useful model!** They needed to perform a final distillation, as noted in the model card, and never did. I think due to lack of compute resources. That's the bad news. The good news is that K2-V2-Instruct (72B) is basically everything I ever hoped Qwen3-72B might be. It is *astoundingly* competent at a wide variety of tasks, especially at long context -- https://huggingface.co/LLM360/K2-V2-Instruct Its main drawback is that as context grows long, it becomes excruciatingly slow. I've stopped watching QuixiAI's Qwen3-72B project, and have been trying K2-V2-Instruct at various tasks. It continues to impress me anew.
View on Reddit #81403973

AvocadoArray@reddit

Is this model trained from scratch, or some kind of distillation of Kimi K2?
View on Reddit #81446398

ttkciar@reddit

It is trained from scratch by LLM360, which is an open-source R&D lab. The model is fully open-source, with all of its training datasets and software available: https://huggingface.co/collections/LLM360/txt360 -- datasets https://github.com/llm360 -- software The lab itself is a cooperative venture between Cerebras, Petuum, and MBZUAI, which all have different reasons for participating in the venture. Petuum is broadly trying to push a common ecosystem for industrial LLM technology, MBZUAI is part of the UAE's industrial diversification effort, and Cerebras hopes to demonstrate to potential customers that their hardware is worth buying. The K2-V2 architecture is plain old llama; all of their innovation is in the augmented datasets and training methodology.
View on Reddit #81447721

AvocadoArray@reddit

I've been playing with a FP8 dynamic quant and wow, this thing behaves quite differently. I asked it to "tell me a long story about a cat and a dog", and it had two hilarious meltdowns in the middle. >Milo could not coil his very own person; the cat’s human was physically left a small remains of maintenance. The cat’s pain. The cat's head. **It would be enough t… t+** **Conclusion** (—> Actually, I realized I got tangled and must restructure the story. To comply with your request, I will continue the narrative coherently with a proper exposition, conflict, resolution, and an emotional conclusion that ties both characters together in a lasting, heartwarming way.) And then >Milo held nothing but silence. Cats never talk to dogs... Let alone to puppies with his \[multiple lines of gibberish\] Thus Okay… writings Sad (The assistant appears to have gotten lost mid‑type, blending an unfinished narrative with format notes.) Other models tend to deteriorate more meaningless nonsense when this happens to them, but this one actually recovered mid-response and gave a sensible story at the end. Still not sure how useful it will be for real work, but oh man it's funny to watch it go off the rails. I wonder if it's a quantization issue? Full output here: [https://pastebin.com/ipMemjbC](https://pastebin.com/ipMemjbC)
View on Reddit #81454693

ttkciar@reddit

That is really bizarre :-( yeah, maybe it is a quant issue. When I asked K2-V2-Instruct to infer a Murderbot Diaries fanfic, it did a fair job (though a bit saccharine): http://ciar.org/h/murderbot.1774024842.k2v2.txt
View on Reddit #81458685

Current_Ferret_4981@reddit

Probably would be the best selling point for 6000 pros. Right now you can get pretty much full performance of 27B at Q5 that fits on a 5090, and scaling up from there is pretty diminishing returns or better for multi agent setups. A 72B at Q5 with a good ratio of deltaNet connections would likely still have decent speed but would really fill out a 6000 pro vram and performance.
View on Reddit #81395132

Spicy_mch4ggis@reddit

You reckon Q5 is good? I’d get more context window . I use Q6 and get around 80k context on a headless 5090.
View on Reddit #81401696

Lucis_unbra@reddit

Q5 is fine, but Q6 is better. At and above Q5, at this size, it's generally about the margin of error you're willing to accept. If the BF16 makes 100 errors per 1000 prompts, then the Q8 might make 110, Q6 115, Q5, 130 and so on. Obviously not a perfect illustration, and it depends on the usecase and the definition of an error. An error might be small, it might be something that's just suboptimal, or it can be fatal for the task. So the question at Q5-Q8 is, if you've accepted to not have the perfect example of the model, you've accepted at least Q8. How much more are you willing to give up for speed and context? My tests show that something starts happening at Q4 for both 27 and 35B where they can deviate 100% on a token, no chance to match the reference output on that token. Doesn't happen on any Q5 quants. So if we say that Q5 is the last stop where the quant is likely a good representation of the capabilities of the model, but will make more errors due to the increased uncertainty... Are you willing to accept a higher risk of a few tokens messing up the output? Or do you want the model to be safer?
View on Reddit #81422175

Current_Ferret_4981@reddit

Also depends on whose quant you use but I agree. That being said, I also use it for code that I can typically write and gave unit tests for, so error correction takes less than an hour or two even if the errors are fatal. I do think speed, context, and accuracy are the tradeoffs. But you can pretty much have all of those with a 6000 pro and still have room for more. Which would suggest a nice 40-70GB version could be perfect
View on Reddit #81433681

Lucis_unbra@reddit

Sure, the quant itself can matter. It's more a generalization regarding any competent quant in that range. I'd say that if you have even one 6000 pro... That's a serious step up. At that point I'd probably forget all about the 27B model and jump to the 122B at Q4-5 for most tasks if possible. The model is a step above, and will run faster than the 27B. You gain a lot of knowledge, even if it can't be run at Max quality. The 27B at native precision is suddenly also a very interesting option, served with a high performance engine? You got a lot of room there for some secondary models. Image generation, voice, if that's something one cares about, or subagent / task models. For the 35B and 27B however? You can have your cake and eat it..
View on Reddit #81445374

Current_Ferret_4981@reddit

Q5 doesn't seem significantly worse than Q8 from what I see. Could do Q6 with semi smaller context or Q5 with any realistic context. But yeah Q6 and 64k context is a sweet spot if that works for use cases, or Q5 and larger context
View on Reddit #81412037

fulgencio_batista@reddit

With a 5090, nvfp4 is worth a shot, heard it preforms much more similarly to q8
View on Reddit #81406679

-Ellary-@reddit

A Qwen 3.5 72b dense will perform same as Qwen 3.5 220b A20b\~. I'd say it will be around old Qwen 3 235b A22b +20-30%.
View on Reddit #81436927

fractalcrust@reddit

it would be too powerful
View on Reddit #81436707

qubridInc@reddit

A Qwen 72B dense model would probably provide solid, reliable reasoning and coding performance that's similar to top-tier closed models, but it'll come with higher computing costs and be less efficient than MoE setups.
View on Reddit #81402751

Rich_Artist_8327@reddit

But what I have noticed is that with multi GPU setup using tensor parallelism, dense models do work much better than moe for some reason. Do moe models require more bandwidth between gpus?
View on Reddit #81413892

a_beautiful_rhind@reddit

MoE models are usually larger so you have to offload. Dense 70 and 120b run circles around MoE where half of it is sitting on ram. MoE is a tradeoff of memory footprint for compute.
View on Reddit #81435436

toothpastespiders@reddit

I really mourn the near extinction of 70b dense models.
View on Reddit #81416603

JustFinishedBSG@reddit

Slowly
View on Reddit #81394331

Admirable-Star7088@reddit

Generally, I personally like quality more than speed, so I would absolutely dig a Qwen3.5 \~70b dense model. With speculative decoding, 70b dense is not unreasonably slow on RAM (imo), at least not for more casual use cases such as chatting/Q&A.
View on Reddit #81414605

Admirable-Star7088@reddit

Generally, I personally like quality more than speed, so I would absolutely dig a Qwen3.5 \~70b dense model. With speculative decoding, 70b dense is not unreasonably slow on RAM (imo), at least not for more casual chatting, such as Q&A.
View on Reddit #81414256

El_90@reddit

Yes please 94GB please lol
View on Reddit #81414250

ForsookComparison@reddit

A Qwen3.5-72B dense would have potential to be SOTA-at-home in a lot of use-cases. But it doesn't always work that way. Qwen2.5-72B really only beat Qwen2.5-32B in knowledge-depth. It's not an automatic win.
View on Reddit #81393847

mrpkeya@reddit

I like "SOTA-at-home"
View on Reddit #81405967

Expensive-Paint-9490@reddit

Like 397B-A17B, roughly.
View on Reddit #81401091

StrikeOner@reddit

not as good as the 328B dense model!
View on Reddit #81398501

jacek2023@reddit

Qwen said during the release of Qwen 3 that they have no plans to build dense models bigger than 32B (and now it's just 27B)
View on Reddit #81394057

colin_colout@reddit

it's gotta run reasonably well on chinese AI accelerators. MOE is the only real option.
View on Reddit #81397719

Gohab2001@reddit

5tps lol
View on Reddit #81397145

nacholunchable@reddit

In terms of quality, probably better, but i use a dumber 20-40 tps over my smarter 10tps model, so i cant imagine what a modern 72B would give me. I, like you, wish we had access though, because for some agentic stuff i run it overnight anyways. The problem is that the bigger dense models are harder and longer to train, so getting a big dense model doesnt neccessary mean youve got a modern, capable, well-trained big dense model. If we did  Id love it for nonchat hands-off stuff. But if i could i only pick one flavor, id pick an moe so i could use the damn thing.
View on Reddit #81396390

SillyLilBear@reddit

Really well but would be slow as hell and would be really hard to run.
View on Reddit #81394332

rhinodevil@reddit

On the low end, at least for my use cases, the dense 9B model performes way better than the 30B and 35B(-A3B) MoE models. The dense 27B model is unfortunately too slow on my consumer-grade hardware. So I guess a 72B dense model would perform much better than 9B and 27B.
View on Reddit #81393408