Users of Qwen3-Next-80B-A3B-Instruct-GGUF, How is Performance & Benchmarks?

[-]

MutantEggroll@reddit

As others have said, Qwen3-Next seems to me like more of a proof-of-concept or pre-release to allow inference providers to update their infrastructure before the actual next-gen models are available. It's done quite poorly on my personal benchmarks (mostly coding prompts) so far, speed is acceptable though (\~1000tk/s pp, \~20tk/s tg w/ DDR5-6000 and a 5090).

As an example, it cannot consistently create a working Snake game. For reference, both Qwen3-Coder-30B-A3B:Q6_K_XL and GPT-OSS-120B manage to make a working game every time.

[-]

xxPoLyGLoTxx@reddit

I feel like this is the de facto response for models that don’t perform well. “This model is more of a proof of concept”. I call BS.

And I’m not saying qwen3-next is bad, because it is a fine model and worth using. But please don’t excuse poor performance as a “proof of concept”. These providers are not spending millions of dollars to train a model for it to be bad.

An example is “Kimi-Linear”. I love Kimi-k2. Fantastic model. Kimi-Linear is far far worse. It gets the most basic things wrong. I can’t recommend its use (this model qwen3-next is better). But don’t chalk it up to “proof-of-concept”.

[-]

blbd@reddit

But both Alibaba Cloud and Kimi gave people public statements that the first releases of these two models including experimental architectural changes. If the providers are openly admitting this and don't have a track record of lying then why shouldn't we believe them when they say these were intentional experiments with new design changes that might take time to finalize? It's basically the equivalent of a beta version of regular software.

[-]

Iory1998@reddit

Don't blame the model. You should now that the llama.cpp implementation was done without optimization. The developer clearly and openly said that. The implementation still needs work.

[-]

MutantEggroll@reddit

As I understood the PR discussions, the lack of optimization only applies to pp/tg speed, not the quality of the output. So an optimized implementation will just produce broken Snake games faster, lol.

[-]

Iory1998@reddit

Fair point. :D

[-]

MustBeSomethingThere@reddit

>"An example is “Kimi-Linear”. I love Kimi-k2. Fantastic model. Kimi-Linear is far far worse."

No sh*t? Kimi K2 is a 1T-A32B model and Kimi Linear is a 48B-A3B model.

[-]

xxPoLyGLoTxx@reddit

Of course, but it gets things wrong that a smaller model won’t. It’s just not a very good model at all. In other words, a 20-30B model will still be better than it.

[-]

MutantEggroll@reddit

Seems I've struck a nerve.

I'm not by any means "excusing" poor performance, and in fact, if you'd read past the first sentence before getting on your soapbox, you'd have seen that I explicitly called it worse than Qwen3-Coder-30B-A3B.

[-]

mantafloppy@reddit

If you have issue with Qwen3-Next-80B-A3B-Instruct-GGUF, its because the Llama.cpp integration was vibe coded.

Qwen3-Next-80B-A3B-Instruct-MLX-4bit is great.

I just try a snake game, and it was easy first try.

Give me any prompt you wanna test, ill give you the result.

[-]

MutantEggroll@reddit

That's quite an accusation - pwilkin worked Qwen3-Next support for quite some time and I didn't get the sense from the PR thread that it was vibe coded.

I'd be interested to see the aider polyglot results of the MLX, although that's a pretty big ask so please don't feel obligated.

[-]

mantafloppy@reddit

Is this not the one that was merge, the author himself says he vibe codes it, not an accusation...

https://www.reddit.com/r/LocalLLaMA/comments/1occyly/qwen3next_80ba3b_llamacpp_implementation_with/

I'll check what aider polyglot is and how to run it.

[-]

MutantEggroll@reddit

Ah ok, I thought you were referring to the whole PR, I follow now. And yeah, it was mentioned a few times in the PR(s) that the implementation wasn't yet optimized, so I'm not passing any judgment on current pp/tg speed. As I understand it though, the model implementation is "correct", so we shouldn't expect higher-quality outputs if it gets optimized, just better speeds.

Aider Polyglot is a pretty intense coding benchmark, and IMO its results closely match my subjective experience with the LLMs I've used. You can find the README for running it here, and I also made a post where I ran it myself, and it has the commands I used, which may be helpful as a reference.

As an FYI, my GPT-OSS-120B runs took \~8hours at \~1000tk/s pp and \~40tk/s tg, and my Qwen3-Coder-30B-A3B runs took \~2.5hours at \~4000tk/s pp and \~150tk/s tg, so they're great to run at night to keep your place warm :)

[-]

Foreign-Beginning-49@reddit

Vibe coding that works and is made by someone with domain expertise or experience is just called enhanced coding. Willing worked on this for month. Ai assisted or not, we are quite lucky to have open source contributions. Best wishes

[-]

Daniel_H212@reddit

I tested all models I had with one single use case that I had, which was basically an instruction-following text summarization task (and also needed some external knowledge), about 15k tokens. I usually use few-shot-prompting with a custom GPT in ChatGPT for this task but I did zero-shot prompting to test these models to see how they'd do. Running on strix halo 128 GB system.

gpt-oss: adhered to prompt format almost exactly, made critical factual error, 350 t/s pp, 35 t/s tg

glm-4.5-air @ q3_k: adhered to prompt format almost exactly, highest quality output, 250 t/s pp, 14 t/s tg

intellect-3 @ q3_k: failed to adhere to prompt format properly because it got fancy, 250 t/s pp, 14 t/s tg

qwen3-vl-32b-thinking @ q8_0: failed to adhere to prompt format by missing a small portion, summary was too long which decreased usefulness, 160 t/s pp, 5 t/s tg

qwen3-vl-30b-a3b-instruct @q8_0: adhered to prompt format almost exactly but added a tiny bit of unwanted formatting, summary also too long, 350 t/s pp, 30 t/s tg

qwen3-vl-30b-a3b-thinking @q8_0: failed to adhere to prompt format by missing a small portion, but output quality was pretty good, 350 t/s pp, 30 t/s tg

lfm2-8b-a1b @ q8_0: failed to adhere to prompt format entirely, but text summary was more or less accurate, 1200 t/s pp, 75 t/s tg

qwen3-next-80b-a3b-instruct @ q4_k_m: failed to adhere to several specifications in prompt format, response too long to be useful, 260 t/s pp, 14 t/s tg

qwen3-next-80b-a3b-thinking @ q4_k_m: adhered to prompt format perfectly (the only one to do so), used more thinking tokens than any other model (almost 2k), output was, in my opinion, perfect, 260 t/s pp, 14 t/s tg

[-]

Aggressive-Bother470@reddit

$ llama-bench -m Qwen_Qwen3-Next-80B-A3B-Thinking-Q4_K_L.gguf --flash-attn 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3next ?B Q4_K - Medium     |  45.36 GiB |    79.67 B | CUDA       |  99 |  1 |           pp512 |       566.21 ± 31.78 |
| qwen3next ?B Q4_K - Medium     |  45.36 GiB |    79.67 B | CUDA       |  99 |  1 |           tg128 |         29.92 ± 1.58 |

build: 7f8ef50cc (7209)

[-]

GlobalLadder9461@reddit

What is the benchmark on you machine for qwen3 a3b30b Q4_K_L. Only then some comparison can be made.

[-]

Aggressive-Bother470@reddit

$ llama-bench -m ~Qwen_Qwen3-30B-A3B-Thinking-2507-Q4_K_L.gguf --flash-attn 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  17.56 GiB |    30.53 B | CUDA       |  99 |  1 |           pp512 |      4066.74 ± 31.53 |
| qwen3moe 30B.A3B Q4_K - Medium |  17.56 GiB |    30.53 B | CUDA       |  99 |  1 |           tg128 |        190.05 ± 0.85 |

build: 7f8ef50cc (7209)

[-]

Aggressive-Bother470@reddit

Around 180t/s

[-]

petuman@reddit

30b would fit into 3090 VRAM, so 100+ tg

[-]

maxwell321@reddit

Something isn't right. I used the AWQ version a couple months ago and it absolutely wiped the floor with a FP8 version of Qwen3-30B-A3B-Instruct. The fact that the GGUF version is giving issues makes me think something went wrong, either with the GGUF quants themselves or the llama.cpp implementation.

[-]

mantafloppy@reddit

If you have issue with Qwen3-Next-80B-A3B-Instruct-GGUF, its because the Llama.cpp integration was vibe coded.

Qwen3-Next-80B-A3B-Instruct-MLX-4bit is great.

[-]

LocoMod@reddit

And the MLX version wasn’t?

[-]

mantafloppy@reddit

You can go ask https://github.com/ml-explore/mlx-lm

They didnt boast about it like this one : https://www.reddit.com/r/LocalLLaMA/comments/1occyly/qwen3next_80ba3b_llamacpp_implementation_with/

[-]

MDT-49@reddit

I don't have a strong opinion (yet) on the "intelligence" of Qwen3-Next, but in my test environment it's performance (t/s) is lacking compared to Qwen3-30B-A3B.

model	size	params	backend	threads	test	t/s
qwen3next ?B Q4_K - Medium	42.01 GiB	79.67 B	CPU	18	pp512	14.25 ± 0.39
qwen3next ?B Q4_K - Medium	42.01 GiB	79.67 B	CPU	18	tg128	0.87 ± 0.07

model	size	params	backend	threads	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	16.47 GiB	30.53 B	CPU	18	pp512	63.59 ± 0.58
qwen3moe 30B.A3B Q4_K - Medium	16.47 GiB	30.53 B	CPU	18	tg128	5.54 ± 0.49

build: e072b2052 (7190)

This is done one a VPS based on (shared) AMD EPYC-Genoa CPU so the results can be influenced by noisy neighbors, but they're pretty consistent based on multiple tests.

[-]

texasdude11@reddit

Qwen3 Next doesn't perform as good for me as compared to Qwen3-Coder-30b

Qwen3-Coder-30b is just a phenomenal instruction following and tool calling model.

[-]

StardockEngineer@reddit

Yeah same. I asked it specifically to do a web search… and it didn’t. Had to reprompt.

[-]

Red_Redditor_Reddit@reddit

I'm getting 3.5 tk/s on dual channel DDR4 and 11.5 tk/s prompt eval, 5Q on 64GB. If it was a vision model it would be golden.

[-]

drwebb@reddit

It's thinking is super unique, also discombobulated and stream of consciousness. It seems cool, but kinda loses context because of it I think. At first you might think it rocks for long running agentic tasks, but then you realize it kinda lost the plot after step one.

[-]

Long_comment_san@reddit

I've always said the best part of ai would be having all these things without the cloud. Yeah those cloud models are insane but having a good enough model to fit in 16gb vram + 64 gb ram and even better ones at 24-32gb + 128 is a godsend. You can do so fking much with just reasonable grade hardware!

[-]

Salt_Discussion8043@reddit

It benchmarks rly well its a viable model

[-]

egomarker@reddit

You are simply using your ssd as ram if model doesn't fit. Luckily it's (almost) only disk reads, so at least it doesn't trash your drive.