Qwen3.6-27B vs Coder-Next
Posted by Signal_Ad657@reddit | LocalLLaMA | View on Reddit | 84 comments
Burned about 20 hours of side-by-side compute on my two RTX PRO 6000 Blackwells trying to get a definitive answer on which of these two models was clearly better. As with many things in life, after many tokens and kWhs later the answer was "it depends."
These models in the aggregate are actually crazy well matched against each other — scoring similarly overall across a wide range of tests and scenarios, hitting and missing on different things, failing and succeeding in different ways. Across the 4 cells I ran at N=10, Coder-Next 25/40 ships, 27B-thinking 30/40 — statistically tied with overlapping Wilson CIs.
On the face of that, it kind of makes sense. 27B is a later-gen dense model that's high on thinking. Coder-Next has roughly 3x the parameters to work with but only activates 3B at a time as it works. Depending on what you're trying to do, either could be the correct choice.
Kind of interestingly, **27B with thinking disabled was the most consistent shipper of work** — 95.8% across the full 12-cell grid at N=10 (Wilson 95% [90.5%, 98.2%]). Same model weights as 27B-thinking, just `--no-think`. A side-by-side hand-graded read on the both-ship cells found substantive output is preserved; the difference is verbosity of reasoning prose, not output decisions. The "thinking-trace as loop substrate" mechanism turned out to be real — the documented word-trim loop on doc-synthesis halves with no-think (4/10 → 2/10).
3.6-35B-A3B pretty much fell flat on its face so often for tasking that it didn't seem worth carrying on to keep comparing against the other two. Folder kept as failure-mode evidence.
I tossed a lot of crazy stuff at these models over the course of a few days and kept my two GPUs very warm and very busy in the process. I jumped into this mainly because, for lack of a better term, I felt like the traditional benchmarks were being gamed. So I wanted to just chuck these guys in the dirt and abuse them and see what happened.
Give them tasks they could win, tasks where they were essentially destined to fail, study how they won and failed and what that looked like. The most lopsided single result: Coder-Next 0/10 on a live market-research task where 27B was 8/10 (Wilson 95% [0%, 27.8%] for the Coder-Next collapse, reproducible). Inverse: Coder-Next ships 10/10 on bounded business-memo and doc-synthesis tasks at 60–100x lower cost-per-shipped-run than either 27B variant. Same models, very different shapes of "good at."
There's a ton of data, I tried to make it easy to sort through, and right now this is all pretty much just about thoroughly comparing these two models.
Either way, I'm sleepy now. Let me know your thoughts or if you have any questions, and the repo is below. I'll talk more about this when I'm not looking to pass out lol.
https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests
relmny@reddit
qwen3.6-27b is great and is actually my main daily driver, but the other day, looking for some text/statement in a PDF, I kinda did a needle-in-haystack test, and 27b always said (tried multiple times) that there was no mention of it (same as qwen3.6-35b).
Then I remembered about coder-next and decided to give it a try... and it did find it, every time (tried a few times).
So coder-next did find something that 3.6-27b kept saying "no, is not there"...
Coder-next is still pretty good, and depending on the tasks/use, it can be better than 3.6-27b
ortegaalfredo@reddit
>Burned about 20 hours of side-by-side compute on my two RTX PRO 6000 Blackwells trying to get a definitive answer on which of these two models was clearly better.
We have to stop illegal LLM fights and most forms of AI cruelty.
AutobahnRaser@reddit
https://spaw.ai
MrObsidian_@reddit
Is this a joke site?
taking_bullet@reddit
👆 At the very bottom of the website
IrisColt@reddit
B-but using renewable energy, r-right, right?
invabun@reddit
lol bro ...
dedSEKTR@reddit
We making matrices fight each other now?
CzarCW@reddit
I know kung fu.
redmctrashface@reddit
$20k kung fu
met_MY_verse@reddit
I know Gauss-Jordan elimination. I’m ready.
gidea@reddit
kung fourier, i’ll see myself out kthxbye
Stunning_Macaron6133@reddit
Show me.
Karyo_Ten@reddit
What you must learn is that these rules are no different than the rules of a computer system. Some of them can be bent. Others can be broken. Understand? Then hit me…if you can.
LegacyRemaster@reddit
fight club?
CriticismTop@reddit
Person Of Interest is seeping into real life
bksubramanyarao@reddit
lol
pminervini@reddit
My experience was vastly different tho, https://neuralnoise.com/2026/harness-bench-wip/?bare
ChocomelP@reddit
I was instantly skeptical of OP's results. If your tests get the same out of these two models, you need new tests.
Chromix_@reddit
That's quite some extensive testing and write-up. One point made there is that using Q8 over Q4 is a net loss, as the (average) results are worse and it's slower. You need more than 16 tasks, and multiple re-runs per task, to get the resulting scores a bit more stable. Otherwise you're interpreting patterns into random noise.
Speaking of which, it could be interesting to review what happened, what changed due to which the Q8 got a worse score than the Q4. Maybe there was some change in action or interpretation that the harness never recovered from. That could be used to improve it. That said, it probably doesn't need to be a Q8 vs Q4 re-run, as just re-running Q4 should also lead to rather diverse results.
pminervini@reddit
> That's quite some extensive testing and write-up. One point made there is that using Q8 over Q4 is a net loss, as the (average) results are worse and it's slower. You need more than 16 tasks, and multiple re-runs per task, to get the resulting scores a bit more stable. Otherwise you're interpreting patterns into random noise.
Totally agree, it's that everything is being run on a single laptop when I'm not using it -- that's the main bottleneck; I'm considering discarding some models, tasks, and harnesses and do multiple seeds just for a subset of those or it will never finish
Chromix_@reddit
That would be nice. You already record the execution time, which is quite different between the runs. gpt-oss-120B took half a minute, while others needed 5+. You could add the token counts for prompt processing and generation to complement that value.
jashAcharjee@reddit
Lets wait for someone to milk out qwen3.6-coder-next-gguf
RedParaglider@reddit
STOP I can only get so erect.
Zya1re-V@reddit
Ohhh let me unlock your full potential 🫂
ChocomelP@reddit
milk it out
ANTIVNTIANTI@reddit
no, there is never “to erect” for Qwen, do not doubt yourself, Qwen does not doubt you!
kevin_1994@reddit
I winder if you could frankenmerge them
florinandrei@reddit
"You see, when two LLMs love each other very much..."
fettpl@reddit
Praying there's a madman already working on that.
viperx7@reddit
For someone like you who is drowning in VRAM it might seem so but for most people its not how things work
For example: Even if someone has 48GB VRAM the choice they face is
- Qwen 3.6 27B @ Q8 with 264k unquantized context - Qwen 3 coder next @ Q4 and still offloading to cpu and maybe they can do 264 context
when choosing the coder next - prompt processing will suffer - it wont be as smart as your version that you tested at Q8
and this is best case scenario a lot of people dont have 48GB vram and try running these models on a 24GB VRAM machine and then we will talk how far your shipping things go.
And if you are not mentioning which Quant you are using for model and context you are using you can take your 20hours on your RTX 6000 PRO and get lost because it doesnt mean shit yes that 20hours of testing is pointless (being a little mean just because you are mean with your meme)
ProfessionalSpend589@reddit
viperx7@reddit
I believe qwen3.6 27B is so small that it makes sense to run it on FP8
Maybe that will do better given 4bit qwen-coder-next takes 45-49 GB I think you should compare with 27B Q8 which is 28GB
Maybe it will be a little fair then
Blues520@reddit
Is FP8 better than Q8?
viperx7@reddit
Theoritically they should be the same its just that FP8 models
Disadvantage is - you can't offload layers to cpu so no mixed infrence (so you better have lots of VRAM) - Vllm takes ages to load the model into memory
Blues520@reddit
Appreciate the explanation
themule71@reddit
Yeah while I get the "fairness" of having all models at the same quant, when comparing models of significant different sizes the smaller one should be made to have the same memory footprint of the larger one.
Of course that's biased too. But I feel that at the extremes you have too few VRAM, and the bigger model needs extreme quantization to run, or too much VRAM, where the larger model sits too comfortably, and memory footprint is made irrelevant.
It's also fair to choose a VRAM size in which the smaller model fits very comfortably, like Q8 and fp16 KV cache and then fit the larger model into that.
viperx7@reddit
I think the most important and meaningful metric is given X amount of VRAM what's the best results one can get
Answer to this can be something like for 24gb vram
themule71@reddit
Correct. My point is, when comparing models to choose one for a task, you shouldn't cripple the smaller ones by using the same quantization.
PANIC_EXCEPTION@reddit
Qwen3-Coder-Next excels on unified memory systems, Qwen3.6-27B excels on beefy Nvidia cards
fettpl@reddit
May I kindly ask you to elaborate? What makes them excel on those specific aspects?
Also, assuming I have Mac mini M4 with 64GB of unified memory, should I focus on running lower quant of 3-Coder-Next or tune parameters of 3.6-27B in a higher quantization?
PANIC_EXCEPTION@reddit
Next with 4-bit MLX is substantially faster than 27B. 27B at 4-bit is already extremely slow on M series, going to higher quants is going to make the token count unacceptable. Dense models are compute bound, and M series simply doesn't compete with Nvidia on that front.
So, Nvidia cards with less VRAM and better compute, use 27B. M series with a ton of unified memory but poor compute, use Next.
havnar-@reddit
I should try it.. 27b 4 bit is 8tps and 8bit is 7tps on my MacBook Pro m5 pro 64gb… I keep thinking it’s me or the machine that’s broken
crantob@reddit
Not even listing the languages the tests are in?
My experience so far: big difference whether churning-out browser-chum, python flappybirds, or *nix systems-programming C.
Gallardo994@reddit
I found it funny that absolutely no Qwens (3.5, 3.6, Coder Next) are good at C# and Unity by default. They all confuse WPF and Avalonia, make up behaviours and methods, and blatantly lie how something works. The only fix to all this is web access, which as a result makes all Qwens heavily rely on their tool calling ability whoever finds information better.
ivoras@reddit
Difference in which direction? I assume C (and other highly specialised environments) would be better covered by the bigger model?
Gallardo994@reddit
I thought it was just me because Q3CN provided much better results than both 3.6 35B and 27B, especially in Hermes. 3.6 was doing weird tool calls (python + pipes) instead of just simply curling for required data or even using browser tool, whereas Q3CN was doing it with no issues.
Hardware: M5 Max MBP 16 with 128GB unified memory.
Models: Qwen3.6-35b-a3b-bf16, Qwen3.6-27b-8bit, Qwen3-coder-next-8bit
brickout@reddit
Sure, but us normies can't run it.
cato_gts@reddit
When I used coder next at work, I couldn't proceed with the work because I repeated the same context in almost all tasks or failed toolcalls. On the other hand, qwen3.6 performed almost all tasks at once and succeeded smoothly except for complex toolcalls.
Due_Net_3342@reddit
true story
TokenRingAI@reddit
27B and 35B are absolute dogshit on VLLM with the int4 quants I have tried.
The official FP8 quants are working far better:
https://huggingface.co/collections/Qwen/qwen36
The Unsloth GGUFs are also working very well.
I suspect your results are way off due to problems with those specific quants.
Qwen 3.6 loves to generate very long output, and with any degradation of the output quality, you will just end up with massive outputs of useless work.
Ok-Measurement-1575@reddit
Agreed.
Unsloth UD4s surpass all the vllm 4 quants this time round.
Blues520@reddit
This has been my experience as well. I initially ran the INT4 quants on vllm and could not get the same experience as other users. Then I tried a regular unsloth Q8 on llamacpp and it's way better. Might try the official FP8 quants as well.
droptableadventures@reddit
I'd definitely say 4 bits is too far quantized for a model that small.
Boring_Office@reddit
I try all the new models, but like a good husband, i always return to qwen-coder-next until her sister is old enough.
PANIC_EXCEPTION@reddit
It's just such a goated workhorse model. No thinking, no config, blazing fast, good at almost any general purpose logic task besides just coding.
pmarsh@reddit
So a really good pair programmer doing one function at a time rather than trying to architect an entire solution on its own?
Boring_Office@reddit
Correct 🙂
KURD_1_STAN@reddit
Okey so ur explanation shows 27b being better, then ur image is wrong and coder next is indeed not better than 27b
SailIntelligent2633@reddit
I cannot express how much pure joy brings me that you incorporated confidence intervals into your benchmarks. It’s something that even the frontier labs seem to still have not figured out yet.
patricious@reddit
TLDR plis.
audioen@reddit
This test has been run under 4-bit which is not getting full quality of these models out. The decision tree also states that on virtually every case you should choose the 27b, despite the meaningless and misleading picture. I have found the qwen3-coder-next to be useless for real work on every size, personally, and not even useful as code completion tool, despite being one of the rare models that has the fill-in-middle ability. It could be a harness issue (which was continue.dev) but the completions it proposes are distracting when they show up, and typically worthless -- I have found myself ignoring them when they show up, because they are usually worse than useless.
If I have to guess, the no thinking is recommended because long context performance degrades too fast and thinking is damaged, so it just adds inference cost and might not be much of a benefit. These 4-bit inference conditions simply are not good enough for the Qwen family, I think 6 bits and beyond are reasonable for GGUF, and official FP8 is the smallest I would recommend for vllm. I have personally tried the cyankiwi 4-bit AWQ before and had to throw it out because it simply wasn't behaving correctly. (The KV cache has not been quantized here, according to the tooling documentation, which is good, as many vllm recipes also quantize KV cache to FP8.)
If you can't run the bf16, then I suggest going no worse than the official fp8. It is known to be among the best, as there was someone who measured the K-L divergences of various AWQ/autoround etc. quants, and the FP8, while among the largest, was in the pareto frontier for its size.
Witty_Mycologist_995@reddit
That’s unfair, one is Dense one is Moe. Do 35b vs 80b
shansoft@reddit
Coder Next is MoE I thought?
stddealer@reddit
Yes and 27B is dense. If coder next is already winning the battle that's unfair towards it, it should win vs the 35B moe easily too?
Pleasant-Shallot-707@reddit
Except that dense models are better than moe’s
stddealer@reddit
Yes they are.
No_Mango7658@reddit
Both moe. It's not supposed to be fair, it's a comparison
florinandrei@reddit
It's not supposed to be fair, it's part of this world.
ndrewpj@reddit
Coder 80b is Moe too
Chromix_@reddit
I started getting useful results with Q3CN using Roo Code (RIP). Whenever there was a case where it seemed stuck, I switched between Qwen3.6-27B-UD-Q5_K_XL, Qwen3.6-35B-A3B-UD-Q8_K_XL and gemma-4-31B-it-UD-Q4_K_XL to see which one gave me a proper solution (also posted a small test for speed x tokens).
Currently the 27B model is my default model. It just doesn't give me enough reasons (failures) to switch away to the other models again. That said, there's a general overlap in coding capability, and Q3CN seems to have an edge in one sort of problems, while the 3.6 27B has one for another sort of problem to solve. Apparently the latter overlaps more with my current use-cases.
Pablo_the_brave@reddit
TL&DR: Look at the decision tree https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests/blob/main/COMPARISON.md
Thank you! Very complex and usable!
texasdude11@reddit
Why not just use minimax M2.7?
macumazana@reddit
bitch has fucked up license
texasdude11@reddit
Ah, yes that's true.
Thomas-Lore@reddit
While we are at it, why not Flash v4. I find it more reliable than M2.7.
texasdude11@reddit
Do you? Imma have to try that.
segmond@reddit
might as well add qwen3.5-122b in the mix. but I say, use whatever model brings joy to you. i think they are all good.
facts. the very larger models have more knowledge. small models are now just as smart as big models, meaning if you present a novel problem and all the data, the small models can probably solve it as large as the big ones. however if you need prior world knowledge, the larger models are likely to have it or make connections.
with that said, most people are not solving complex problem with these things, so the small models are very adequate.
philmarcracken@reddit
its me, im grug. I hook icue sdk to some slop exe that sleeps my monitors + keyboard backlight, grug happy man
ciprianveg@reddit
can't wait for qwen 3.6-122b-coder :)
mjsxi__@reddit
do you think they'll release something like this?
ciprianveg@reddit
one can only hope :)
No_Mango7658@reddit
Can't wait for 3.6 122b
Nobby_Binks@reddit
Why not run one model on each card and have them argue with each other. the model that outwits the other wins.