How the heck is Qwen3-Coder so fast? Nearly 10x other models.

Posted by CSEliot@reddit | LocalLLaMA | View on Reddit | 26 comments

My Strix Halo w/ 64gb VRAM, (other half on RAM) runs Qwen3-Coder at 30t/s roughly. And that's the Unsloth Q8_K_XL 36GB quant.
Other's of SIMILAR SIZE AND QUANT perform at maybe 4-10 tok/s.

How is this possible?! Seed-OSS-36B (Unsloth) gives me 4 t/s (although, it does produce more accurate results given a system prompt.)

You can see results from benchmarks here:
https://kyuz0.github.io/amd-strix-halo-toolboxes/

I'm speaking from personal experience, but this benchmark tool is here to support.

[-]

horribleGuy3115@reddit

What's your hardware setup looks like for running ?

[-]

CSEliot@reddit (OP)

It's just the ROG FlowZ13 2025 w/ 128GB of Unified RAM.

[-]

Mean_Employment_7679@reddit

Considering getting a 5090 for local coding.

Does the speed make up for the shortcomings Vs something like opus 4.5? Or should I just use the money for 2 years of Claude max?

[-]

OcelotMadness@reddit

Depends on if Data security matters to you. If your using an API then you should assume that the company is reading your requests and seeing your code when Claude code edits it.

[-]

Mean_Employment_7679@reddit

Yeah I've held back on sending anything private, but that peace of mind would be valuable.

But will it actually work or is it throwing money at a solution that won't pay off?

[-]

It will work but temper your expectations. I would load up your chosen models over API and use them like that at first so you get a good idea of the quality of the answers your going to get. Then look up benchmarks and a token per second visualizer so you can see it will be a little slower than API.

If your looking for a superintelligent AI that will write all your code for you, your definitely not gonna get that, even with claude, but if you have stackoverflow esque questions and want some boilerplate, you will be served well by qwen3 30b or any similar. (Ive heard the new dense 32bVL is actually code at coding, especially if you like to draw your UIs before implementing them, I would look into that)

Overall its a huge purchase dude, im sorry to pull this card but your going to need to research a little.

[-]

social_tech_10@reddit

Your answers seem intelligent and thoughtful, but every time I read "your" when it should be "you're", it makes a sound in my head like fingernails grating on a blackboard. That triggers me to think "don't listen to this guy, he's an idiot", and every single time I have to claw myself back from the edge and try to convince myself, "no, don't worry, this is a perfectly reasonable answer, don't be a grammar nazi, just let it go". And so I try to do that, but it's still a huge distraction.

[-]

Artistic_Okra7288@reddit

I can run gpt-oss-20b at about 200 tps which is insane to me. I wish we could optimize Qwen3 MoE to get that fast.

[-]

cafedude@reddit

BTW: as a strix halo owner I really appreciate your comprehensive spreadsheet with all of your test results for various models and quants. Thank you!

[-]

CSEliot@reddit (OP)

Not my work, just a fan.

Btw, there's 2 complete variants of Strix Halo: Mobile and Desktop. On my ROG FlowZ13 I'm getting 50% the performance of desktop builds. The chart doesn't show this.

[-]

AlbeHxT_1@reddit

It's a mixture of experts model. 30b total but only 3b activated per token.
seed oss 36b is a dense model, so all parameters are used every iteration, that's why it's slower

[-]

CSEliot@reddit (OP)

OOooooooh that makes perfect sense. I feel dumb for not realizing that. Thank you!

[-]

XiRw@reddit

Thanks for explaining that in a clear concise way. Someone else mentioned the differences between the two but it sounded very convoluted or possibly inaccurate.

[-]

aiueka@reddit

How does the 30b moe compare to the 32b non moe for performance?

[-]

mantafloppy@reddit

As all the other comment explain, the answer is MoE.

You created the issue by shortening the name of the model to something that don't actually exist...

You can easily see that its MoE by the name.

Qwen/Qwen3-Coder-480B-A35B-Instruct

Qwen/Qwen3-Coder-30B-A3B-Instruct

Qwen/Qwen3-Next-80B-A3B-Instruct

Qwen/Qwen3-32B

[-]

decrement--@reddit

So 30B is total size and A3B is the Active weights?

[-]

Medium_Chemist_4032@reddit

Isn't qwen3-coder simply a a3b moe variant? So, it's a set of 3b experts?

[-]

Medium_Chemist_4032@reddit

Downvoters - care to explain, why the same answer below is upvoted? huh

[-]

chibop1@reddit

You don't know this sub has down voting bots? A comment that just says "thanks" gets down votes.

[-]

iron_coffin@reddit

It might be reddit's antibot algo too

[-]

PeithonKing@reddit

don't worry, I downvoted the other comment

[-]

AlbeHxT_1@reddit

bruh

[-]

iron_coffin@reddit

I didn't downvote (but I'm a gamer that knows everything): He doesn't know enough to understand your answer based on his question.

[-]

Steuern_Runter@reddit

Actually each expert is around 0.4B parameters big but 8 of them are active at the same time.

[-]

getting_serious@reddit

Wait until you see the 480B-A35B one.

[-]

suicidaleggroll@reddit

It's an MoE model. Very very roughly, it has the "knowledge" of a 30b model but runs at the speed of a 3b model. A 30b-a3b MoE model is not quite as good as a dense 30b model, but is much much better than a dense 3b model, and runs roughly at the speed of a 3b model assuming you have enough VRAM to hold the whole thing (even if you don't, MoE models allow you to offload individual experts to the CPU without impacting performance nearly as much as offloading part of a dense model).

Most of the big models are MoE - MiniMax, Qwen, Kimi, Deepseek, etc. because they offer a good compromise between accuracy and speed, provided you have lots of RAM+VRAM.