Qwen3-Next-80B-A3B vs gpt-oss-120b

Posted by bfroemel@reddit | LocalLLaMA | View on Reddit | 51 comments

Benchmarks aside - who has the better experience with what model and why? Please comment incl. your use-cases (incl. your software stack in case you use more than llama.cpp/vllm/sglang).

My main use case is agentic coding/software engineering (Python, see my comment history for details) and gpt-oss-120b remains the clear winner (although I am limited to Qwen3-Next-80B-A3B-Instruct-UD-Q8_K_XL; using recommended sampling parameters for both models). I haven't tried tool calls with Qwen3-Next yet, but did just simple coding tasks right within llama.cpp's web frontend. For me gpt-oss consistently comes up with a more nuanced, correct solution faster while Qwen3-Next usually needs more shots. (Funnily, when I let gpt-oss-120b correct a solution that Qwen3-Next thinks is already production-grade quality, it admits its mistakes right away and has only the highest praises for the corrections). I did not even try the Thinking version, because benchmarks (e.g., also see Discord aider) show that Instruct is much better than Thinking for coding use-cases.

At least in regard to my main use case I am particularly impressed by the difference in memory requirements: gpt-oss-120b mxfp4 is about 65 GB, that's more than 25% smaller than Qwen3-Next-80B-A3B (the 8-bit quantized version still requires about 85 GB VRAM).

Qwen3-Next might be better in other regards and/or has to be used differently. Also I think Qwen3-Next has been more intended as a preview, so it might me more about the model architecture, training method advances, and less about its usefulness in actual real-world tasks.

[-]

KaroYadgar@reddit

GPT-120B is a pretty good model, despite the complaints.

[-]

ProtectionFar4563@reddit

I’ve found it very capable, but it argues very persistently about things (like if I mention that some software has a newer version that wasn’t available when it was trained, it’ll insist that it’s a forthcoming version). I don’t think I’ve encountered this nearly as much in any other model.

[-]

Front-Relief473@reddit

No, I'm discussing with gemini2.5 pro that it will be the same if don't search online. This is not a problem.

[-]

egomarker@reddit

gpt-oss120b is 4bit quantized by design, that's why it's using less ram.

Overall despite all the grievances about censorship (I never actually seen refusal while using the model, but i'm not using it as a girlfriend) gpt-oss 120b (and 20b) are really pulling over their weight.
I think Qwen3-Next was intended to be more like a test or "dev kit" of Qwen's future model design, so everyone has time to adjust their APIs. It is not super smart.

[-]

Caffdy@reddit

is Kimi K2 Thinking quantized by design as well?

[-]

StaysAwakeAllWeek@reddit

Yes, but you'll still need a full H200 node to run it properly

[-]

JockY@reddit

We can run K2 Thinking with the new sglang + ktransformers integration. I’m running it at 30 tokens/sec on Blackwell/Epyc.

[-]

ResidentTicket1273@reddit

What minimal hardware requirements would you need to meet to run the gtp-oss-120b?

[-]

ak_sys@reddit

I get 35 tk a sec with a 5080, 9800x3D and 64gb ram. If you have at least 16gb of vram and 32gb ram, it'll run fast enough to be usable.

I find myself using gpt 20b much more(I get 180tk sec), but if I NEED a better model 120b is an option.

[-]

hieuphamduy@reddit

120b is 64gb in size right ? Did you run the default MXFP4 quant ? If not, how did you managed to fit it with that ram size ?

[-]

Illya___@reddit

Gpt oss I found to be rather garbage, it's ok for casual talk but otherwise hallucinate all over the place when I ask something more technical. GLM Air is much better. Qwen3 Next idk, didn't tried much, it felt ok but I wasn't impressed.

[-]

Dontdoitagain69@reddit

Nah gpt20 is my boy , but it depends on a use case. All models are garbage in garbage out

[-]

Illustrious-Dot-6888@reddit

Oss 120 is better, despite my nauseating aversion to Altman's crooked face

[-]

Dontdoitagain69@reddit

Yeah what’s up with their faces Altman, Musk , Google people. 😂

[-]

Holiday_Purpose_3166@reddit

Qwen3-Next is indeed a preview as they were looking for feedback on this new architecture.

Having used the Instruct version, MXFP4 from noctrex, the model needs way too much babysitting to get the task done in Kilocode. The Qwen3-30B 2507 series execute significantly better in my uses.

For this matter, I don't use Kilocode default agents when testing the models. My system prompts are custom to ensure they operate to match the model's quality.

That being said, Qwen3-Next operated correctly with the system prompt used on my Qwen3 30B models, but kept going doing extra work unnecessarily, taking 340k tokens to add a statCard to a NextJS website, where Qwen3-Coder-30B did under 60k. The job was simple enough not to require such complex guidance where even Magistral Small 1.2 did in 37k tokens.

GPT-OSS-120B simply runs faster (PP \~900t/s vs \~300t/s for the same task) in my Ryzen 9 9950X + RTX 5090 at 131072 context window.

GPT-OSS-120B definitely provides more depth in its replies by default, however it's not something you really need in coding, unless you're dealing with sensitive data that requires precision. GPT-OSS-20B makes up for most work in coding for identical quality, where the 120B could be an over-sized worker.

By default, all being equal, GPT-OSS-120B is more token efficient than GPT-OSS-20B, where the smaller sibling stresses more to get the right answer. If system prompt is polished, the 20B executes as efficiently. They did the same job above in <50k tokens with Medium thinking effort.

I can say between Qwen and GPT-OSS architecture, the latter pays better, especially in longer context.

GPT-OSS models spend less time looking for context to accomplish task, where Qwen models tend to ingest more information. Qwen inference speed also degrades very quickly, making GPT-OSS-120B look faster at 100k tokens.

Despite Qwen having longer context window ability, I speculate that it won't be a pleasant experience. With GPT-OSS models being more efficient, that means faster completions.

I hope that helps.

[-]

Dontdoitagain69@reddit

I use 20B 90% of the time as a background assistant, “garbage collector “ style

[-]

Aggressive-Bother470@reddit

Nothing comes close to gpt120 so far.

Is anyone even trying?

[-]

Dontdoitagain69@reddit

You don’t think GLM 4.6 is up there?, haven’t used either for a solid use case.

[-]

work_urek03@reddit

Not even Glm 4.5? Or intellect 3

[-]

New_Comfortable7240@reddit

In theory Intellect 3 is on par or better than OSS 120

[-]

Aggressive-Bother470@reddit

I found them very similar but gpt considerably faster.

Maybe I should redownload air and give it another shot but it would have to be significantly better to make up for the speed deficit.

I'm at the point where the basics now work so well, I just need some sort of secondary solution for schema/syntax updates that can correct my two best models being slightly out of date on certain things.

[-]

Odd-Ordinary-5922@reddit

GLM 4.5 is like 3x the size

[-]

work_urek03@reddit

I meant air sorry

[-]

noiserr@reddit

I spent a whole week last week trying to find the best model for agentic coding (OpenCode) on my 128GB Strix Halo machine. I tried every model I could find that fits on the machine. Iterating with different system prompts, and I couldn't find anything better than the gpt120B. Particularly on high reasoning setting.

The model follows instruction really damn well. I can leave it coding for like 20 minutes and it will just happily chug along. It's also fast due to native mxfp4 quantization.

The model does make a lot of mistakes, and for just one shot coding Qwen3Coder may actually be better. But Qwen3 models just don't follow instructions well enough to be used in the an agentic setting.

If other models could figure out instruction following, then there could be discussion but as it is right now, nothing competes with gpt-oss-120B, at least for 128GB machines. GLM 4.6 for instance is pretty good when I tried it in cloud but it's so much bigger.

[-]

Aggressive-Bother470@reddit

This almost mirrors my experience. The only thing that comes close is 2507 Thinking but agentically it's 'lazier' (not trained to the same degree?).

I assume it's ability to follow instructions so well is what keeps it almost neck and neck with gpt120.

The speed and capability of gpt120 is unmatched at this size for me.

[-]

Dontdoitagain69@reddit

I use gpt oss 20b, with extremely strict input and structured output for c++ agentic tasks. It’s just a service runs in a background and fixes bunch of mistakes , kind of like a smart garbage collector. As far as all model out there , without purpose it’s hard to tell, which one is the best,. it’s up to you to see what properties of models you would need and make the most of it and I’m sure 120B is maybe the bset open source, not sure but one model that impressed me that I haven’t used because it’s slow on my setup is a full gllm 4.6 202k context. It actually analyzed and rewrote an argon2 hashing algorithm while I was sleeping , that was a surprise . As of today I think it’s better than sonnet or opus as far as unsupervised programming. TLDR GPT20B and GPT120B , GLM4.6, and PHI models for fine tuning and experimenting with

[-]

xjE4644Eyc@reddit

I'm sticking with GPT-120b. I tried Qwen3-Next-Thinking q8 and it spent 8 minute thinking vs 30 seconds for GPT-120b for the same quality answer.

Excited to see what the next iteration of Qwen-Next is though

[-]

WeekLarge7607@reddit

From my experience (running both models on vllm), qwen next is better at tool calling than gpt-oss. At least when using the chat/completions endpoint. Tool calling with Gpt-oss only works for me with the /responses endpoint.

[-]

bfroemel@reddit (OP)

I also struggled a lot with vllm and sglang to get tool calling working reliably with gpt-oss. I ended up sacrificing some batching performance and currently use a minimally patched llama.cpp where the reasoning content ends up in the "reasoning" field (and not "reasoning_content"). With this I have maybe 1 or two failed per 100 total tool calls (codex-cli with serena and docs-mcp-server).

[-]

mixedTape3123@reddit

anyone try the q2 quants of either qwen3 30b or 80b and found them usable?

[-]

dtdisapointingresult@reddit

GPT OSS 120b is 5.1b active params vs 3b on Qwen. Assuming both teams are equally talented, I would expect GPT-OSS to be superior.

[-]

MaxKruse96@reddit

if you want to compare these models, compare by their filesize. gptoss is 59gb. qwen3next would be Q5 K XL.

[-]

Valuable-Run2129@reddit

The astonishing thing is that qwen3next q4 is roughly twice as slow to process input tokens. That alone is a deal breaker for me.

[-]

Odd-Ordinary-5922@reddit

the optimizations arent out yet on the github but it should be faster later on when they do come out

[-]

Valuable-Run2129@reddit

I’ve been using the two mlx model and those are well optimized. Qwen3next is still twice as slow to process prompts (same quant).

[-]

Odd-Ordinary-5922@reddit

oh damn

[-]

StardockEngineer@reddit

No need if op finds it not the good at a higher quant.

[-]

Dhomochevsky_blame@reddit

been bouncing between qwen3 and glm4.6 for agentic stuff lately. glm4.6 handles multi-step reasoning pretty well and memory usage isnt bad, around 70-75gb for the larger quants. havent pushed gpt-oss yet but curious how it compares

[-]

ArchdukeofHyperbole@reddit

I never found a oss 120 quant that would fit in my ram. Even if I do, I probably wouldn't bother with it since it has more active parameters than qwen next which makes it slower and that matters when it uses compute to decide if it even wants to answer a prompt. Qwen next q4 fits in my ram and I use instruct version so there's less wait for the response. Next runs at 3 tokens/sec on my cpu and I'll be trying out vulkan eventually.

[-]

Chromix_@reddit

Long-context handling! It doesn't require much VRAM on both models. gpt-oss-120b was quite a step up over other open models for correctly handling longer context. It was still making mistakes though, especially when yarn-extended from 128k to 256k where it would hallucinate a lot more.

Qwen3-Next on the other hand (tested UD_Q5_K_XL) aced most of my tests, even the instruct version which performs a lot worse than the thinking version at longer context sizes. My tests were targeted information extraction from texts in the 80k to 250k token range, that didn't involve pure retrieval, but required connecting a few dots to identify what was asked for.

I find that surprising, as it scored worse than gpt-oss in the NYT connections bench. My tests weren't exhaustive in any way though - maybe just luck.

[-]

zipzag@reddit

Large context probably is probably partly a result of quantization. I've seen his too with gpt 120b.

[-]

Chromix_@reddit

The quality or the VRAM requirement? Both models have an attention mechanism that requires way less (V)RAM with higher context sizes than most other model, like the normal Qwen3 models for example. This works independent of model quantization.

[-]

dumb_ledorre@reddit

???
Why do you compare a 4-bit version with a 8-bit version, and then complain that the 8-bit one is bigger ???

[-]

AskAmbitious5697@reddit

The point OP is making is that 60GB model is outperforming a 85GB model.

The fuck you so shocked about?

[-]

bfroemel@reddit (OP)

I am not complaining about Qwen3-Next - I am impressed by gpt-oss-120b :)

Ok, I could use a 4-bit quant of Qwen3-Next -- and that would be smaller than gpt-oss-120b. However for coding use cases a more aggressive quantization leads to even worse results. Also I wanted to stick to the originally released model versions as close as possible and gpt-oss-120b is imo superior in that regard.

[-]

audioen@reddit

A reasonable mid-tier choice is Q6_K. It is virtually undistinguishable from 8bit quantization, but still something like 25 % smaller. Comes within about 2 GB of gpt-oss-120b, so very comparable in terms of memory ask.

gpt-oss-120b now has the "derestricted" version from ArliAI. I'm testing it and while I don't see refusals from the model in my normal use, I doubt I see any refusals whatsoever after this. It always complies and uses its terse, tl;dr focused writing style that I quite like as I can just interrupt the response early most of the time.

[-]

twack3r@reddit

+1 on the derestricted models. Will have to give GPTOSS120B derestricted a whirl, GLM4.5Air already had me pretty speechless. Not just because of less refusals but it ‚feels‘ different. Way less inference effort spent on compliance checks, may more inference available for the actual query.

[-]

WhaleFactory@reddit

I love Qwen models and use them extensively, but gpt-oss-120b is the clear winner in my experience.

[-]

zipzag@reddit

Qwen3-vl is well differentiated and very useful. But I find Qwen3 generally dumber at similar size compared to gpt-oss.

Of course it all depends on the task. I do like the many flavors and sizes Qwen offers. If OpenAI doesn't update gpt-oss next year I'm sure Qwen4 can beat it.

[-]

MarkoMarjamaa@reddit

gpt-oss-120b is a good model...
https://artificialanalysis.ai/models/comparisons/qwen3-next-80b-a3b-reasoning-vs-gpt-oss-120b#intelligence

[-]

AppearanceHeavy6724@reddit

Artificial Analysis is meaningless benchmark.