Users of Qwen3-Next-80B-A3B-Instruct-GGUF, How is Performance & Benchmarks?
Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 32 comments
It's been over a day we got GGUF. Please share your experience. Thanks
At first, I didn't believe that we could run this model just with 30GB RAM(Yes, RAM only) .... Unsloth posted a thread actually. Then someone shared a stat on that.
17 t/s just with 32GB RAM + 10GB VRAM using Q4
Good for Poor GPU Club.
MutantEggroll@reddit
As others have said, Qwen3-Next seems to me like more of a proof-of-concept or pre-release to allow inference providers to update their infrastructure before the actual next-gen models are available. It's done quite poorly on my personal benchmarks (mostly coding prompts) so far, speed is acceptable though (\~1000tk/s pp, \~20tk/s tg w/ DDR5-6000 and a 5090).
As an example, it cannot consistently create a working Snake game. For reference, both Qwen3-Coder-30B-A3B:Q6_K_XL and GPT-OSS-120B manage to make a working game every time.
xxPoLyGLoTxx@reddit
I feel like this is the de facto response for models that don’t perform well. “This model is more of a proof of concept”. I call BS.
And I’m not saying qwen3-next is bad, because it is a fine model and worth using. But please don’t excuse poor performance as a “proof of concept”. These providers are not spending millions of dollars to train a model for it to be bad.
An example is “Kimi-Linear”. I love Kimi-k2. Fantastic model. Kimi-Linear is far far worse. It gets the most basic things wrong. I can’t recommend its use (this model qwen3-next is better). But don’t chalk it up to “proof-of-concept”.
blbd@reddit
But both Alibaba Cloud and Kimi gave people public statements that the first releases of these two models including experimental architectural changes. If the providers are openly admitting this and don't have a track record of lying then why shouldn't we believe them when they say these were intentional experiments with new design changes that might take time to finalize? It's basically the equivalent of a beta version of regular software.
Iory1998@reddit
Don't blame the model. You should now that the llama.cpp implementation was done without optimization. The developer clearly and openly said that. The implementation still needs work.
MutantEggroll@reddit
As I understood the PR discussions, the lack of optimization only applies to pp/tg speed, not the quality of the output. So an optimized implementation will just produce broken Snake games faster, lol.
Iory1998@reddit
Fair point. :D
MustBeSomethingThere@reddit
>"An example is “Kimi-Linear”. I love Kimi-k2. Fantastic model. Kimi-Linear is far far worse."
No sh*t? Kimi K2 is a 1T-A32B model and Kimi Linear is a 48B-A3B model.
xxPoLyGLoTxx@reddit
Of course, but it gets things wrong that a smaller model won’t. It’s just not a very good model at all. In other words, a 20-30B model will still be better than it.
MutantEggroll@reddit
Seems I've struck a nerve.
I'm not by any means "excusing" poor performance, and in fact, if you'd read past the first sentence before getting on your soapbox, you'd have seen that I explicitly called it worse than Qwen3-Coder-30B-A3B.
mantafloppy@reddit
If you have issue with Qwen3-Next-80B-A3B-Instruct-GGUF, its because the Llama.cpp integration was vibe coded.
Qwen3-Next-80B-A3B-Instruct-MLX-4bit is great.
I just try a snake game, and it was easy first try.
Give me any prompt you wanna test, ill give you the result.
MutantEggroll@reddit
That's quite an accusation - pwilkin worked Qwen3-Next support for quite some time and I didn't get the sense from the PR thread that it was vibe coded.
I'd be interested to see the aider polyglot results of the MLX, although that's a pretty big ask so please don't feel obligated.
mantafloppy@reddit
Is this not the one that was merge, the author himself says he vibe codes it, not an accusation...
https://www.reddit.com/r/LocalLLaMA/comments/1occyly/qwen3next_80ba3b_llamacpp_implementation_with/
I'll check what aider polyglot is and how to run it.
MutantEggroll@reddit
Ah ok, I thought you were referring to the whole PR, I follow now. And yeah, it was mentioned a few times in the PR(s) that the implementation wasn't yet optimized, so I'm not passing any judgment on current pp/tg speed. As I understand it though, the model implementation is "correct", so we shouldn't expect higher-quality outputs if it gets optimized, just better speeds.
Aider Polyglot is a pretty intense coding benchmark, and IMO its results closely match my subjective experience with the LLMs I've used. You can find the README for running it here, and I also made a post where I ran it myself, and it has the commands I used, which may be helpful as a reference.
As an FYI, my GPT-OSS-120B runs took \~8hours at \~1000tk/s pp and \~40tk/s tg, and my Qwen3-Coder-30B-A3B runs took \~2.5hours at \~4000tk/s pp and \~150tk/s tg, so they're great to run at night to keep your place warm :)
Foreign-Beginning-49@reddit
Vibe coding that works and is made by someone with domain expertise or experience is just called enhanced coding. Willing worked on this for month. Ai assisted or not, we are quite lucky to have open source contributions. Best wishes
Daniel_H212@reddit
I tested all models I had with one single use case that I had, which was basically an instruction-following text summarization task (and also needed some external knowledge), about 15k tokens. I usually use few-shot-prompting with a custom GPT in ChatGPT for this task but I did zero-shot prompting to test these models to see how they'd do. Running on strix halo 128 GB system.
gpt-oss: adhered to prompt format almost exactly, made critical factual error, 350 t/s pp, 35 t/s tg
glm-4.5-air @ q3_k: adhered to prompt format almost exactly, highest quality output, 250 t/s pp, 14 t/s tg
intellect-3 @ q3_k: failed to adhere to prompt format properly because it got fancy, 250 t/s pp, 14 t/s tg
qwen3-vl-32b-thinking @ q8_0: failed to adhere to prompt format by missing a small portion, summary was too long which decreased usefulness, 160 t/s pp, 5 t/s tg
qwen3-vl-30b-a3b-instruct @q8_0: adhered to prompt format almost exactly but added a tiny bit of unwanted formatting, summary also too long, 350 t/s pp, 30 t/s tg
qwen3-vl-30b-a3b-thinking @q8_0: failed to adhere to prompt format by missing a small portion, but output quality was pretty good, 350 t/s pp, 30 t/s tg
lfm2-8b-a1b @ q8_0: failed to adhere to prompt format entirely, but text summary was more or less accurate, 1200 t/s pp, 75 t/s tg
qwen3-next-80b-a3b-instruct @ q4_k_m: failed to adhere to several specifications in prompt format, response too long to be useful, 260 t/s pp, 14 t/s tg
qwen3-next-80b-a3b-thinking @ q4_k_m: adhered to prompt format perfectly (the only one to do so), used more thinking tokens than any other model (almost 2k), output was, in my opinion, perfect, 260 t/s pp, 14 t/s tg
Aggressive-Bother470@reddit
GlobalLadder9461@reddit
What is the benchmark on you machine for qwen3 a3b30b Q4_K_L. Only then some comparison can be made.
Aggressive-Bother470@reddit
Aggressive-Bother470@reddit
Around 180t/s
petuman@reddit
30b would fit into 3090 VRAM, so 100+ tg
maxwell321@reddit
Something isn't right. I used the AWQ version a couple months ago and it absolutely wiped the floor with a FP8 version of Qwen3-30B-A3B-Instruct. The fact that the GGUF version is giving issues makes me think something went wrong, either with the GGUF quants themselves or the llama.cpp implementation.
mantafloppy@reddit
If you have issue with Qwen3-Next-80B-A3B-Instruct-GGUF, its because the Llama.cpp integration was vibe coded.
Qwen3-Next-80B-A3B-Instruct-MLX-4bit is great.
LocoMod@reddit
And the MLX version wasn’t?
mantafloppy@reddit
You can go ask https://github.com/ml-explore/mlx-lm
They didnt boast about it like this one : https://www.reddit.com/r/LocalLLaMA/comments/1occyly/qwen3next_80ba3b_llamacpp_implementation_with/
MDT-49@reddit
I don't have a strong opinion (yet) on the "intelligence" of Qwen3-Next, but in my test environment it's performance (t/s) is lacking compared to Qwen3-30B-A3B.
build: e072b2052 (7190)
This is done one a VPS based on (shared) AMD EPYC-Genoa CPU so the results can be influenced by noisy neighbors, but they're pretty consistent based on multiple tests.
texasdude11@reddit
Qwen3 Next doesn't perform as good for me as compared to Qwen3-Coder-30b
Qwen3-Coder-30b is just a phenomenal instruction following and tool calling model.
StardockEngineer@reddit
Yeah same. I asked it specifically to do a web search… and it didn’t. Had to reprompt.
Red_Redditor_Reddit@reddit
I'm getting 3.5 tk/s on dual channel DDR4 and 11.5 tk/s prompt eval, 5Q on 64GB. If it was a vision model it would be golden.
drwebb@reddit
It's thinking is super unique, also discombobulated and stream of consciousness. It seems cool, but kinda loses context because of it I think. At first you might think it rocks for long running agentic tasks, but then you realize it kinda lost the plot after step one.
Long_comment_san@reddit
I've always said the best part of ai would be having all these things without the cloud. Yeah those cloud models are insane but having a good enough model to fit in 16gb vram + 64 gb ram and even better ones at 24-32gb + 128 is a godsend. You can do so fking much with just reasonable grade hardware!
Salt_Discussion8043@reddit
It benchmarks rly well its a viable model
egomarker@reddit
You are simply using your ssd as ram if model doesn't fit. Luckily it's (almost) only disk reads, so at least it doesn't trash your drive.