Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives
Posted by gladkos@reddit | LocalLLaMA | View on Reddit | 123 comments
Compared Qwen 3.6 35B model with the New Qwen 3.6 35B boosted by TurboQuant.
MacBook Pro M5 MAX 64GB
Qwen 3.6 35B - 72 TPS
Qwen 3.6 27B - 18 TPS
Tested coding primitives. The 27B model thinks more, but the result is more precise and correct. The 35B model handled the task worse, but did it faster.
What's your experience?
Prompt: Write a single HTML file with a full-page canvas and no libraries. Simulate a realistic side-view of a moving car as the main subject. Keep the car visible in the foreground while the background landscape scrolls continuously to create the feeling that the car is driving forward. Use layered scenery for depth: nearby ground, roadside elements, trees, poles, and distant hills or mountains should move at different speeds for a natural parallax effect. Animate the wheels spinning realistically and add subtle body motion so the car feels connected to the road. Let the environment pass smoothly behind it, with repeating but varied scenery that makes the movement feel believable. Use cinematic lighting and a cohesive sky, such as sunset, dusk, or daylight, to enhance atmosphere. The overall motion should feel calm, immersive, and realistic, with a seamless looping animation.
thedatawhiz@reddit
What is the stack behind this demo?
gladkos@reddit (OP)
MacBook Pro M5 MAX 64GB
llama.cpp with turboquant via atomic.chat app
MrTacoSauces@reddit
I wonder where these models are finding enough training examples to imagine and visualize in code/SVG at a meaningful scale generate a whole scene not part of the training.
For a model that's trained on vision I believe that's a separate part of the model that is related to the LLM part of the weights. Is the vision part able to relate it's world view/"attention" of imagery to drive the main part of the weights to generate a scene to the users prompt?
I understood chatgpt/Claude generating decent enough looking svgs of simple objects by brute force of available svg data to sort of understand an object even through chain of thought reasoning. But this round of small models generating scenes even at the opus scale is confusing. It seems like a general world view is slowly but if not persistently being crammed into models at every scale. I'd love to be a fly on the wall of one these training teams.
gffcdddc@reddit
Semantic links in the actual training between tokens. It can generalize well.
Healthy-Nebula-3603@reddit
Vision is not a separate part.
Actually you can even add a vision decoder (decoder is an module that simulate an eye in simple words ) to a not vision model trained and will be working!
That's is the weirdest part
Remarkable_Living_80@reddit
https://i.redd.it/w4cknkjqzmxg1.gif
You guys sure you are using correct parameters?
Here is first try with 3.6 27B iq4_xs.gguf --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 --repeat-penalty 1 --presence_penalty 0 --multiline-input --ctx-size 32768 --no-mmap
shokuninstudio@reddit
On an M4 Mac 48GB:
qwen3.6:27b-q4_K_M consumed far more memory than the quant should have, made the system jerky, and froze.
gemma4:26b-a4b-it-q4_K_M produced a black html page and when it tried to fix it the output got stuck in a broken infinite loop with dumb comments at the end of it...
ons_list/sessions_history/sessions_send/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/write/exec/process/subagents/sessions_list/sessions_history/sessions_send/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/weite/exec/process/subagents/sessions_list/sessions_history/sessions_send/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/write/exec/process/subagents/sessions_list/sessions_history/sessions_send/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/write/exec/process/subagents/sessions_list/sessions_history/sessions_end/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/write/exec/process/subagents/sessions_list/sessions_history/sessions_send/sessions_spawn/session_statusmemory_get/memory_search/web_search/web_fetch/image/edit/read/// (rest of the tool list is long and messy in my internal thinking, but you get the point) (I won't repeat that disaster again. I'm cleaning it up.)
c4short123@reddit
Can you compare with qwen 3 coder next?
Healthy-Nebula-3603@reddit
Why?
That model is very old
c4short123@reddit
Ask Claude to do an assessment of the two. Seems to think coder next still has value as a coder ahead of the new generalist models. I’m wondering what the benchmark is.
Healthy-Nebula-3603@reddit
What kind of value? That old model in not even working under agents environment a s tool calling.
Is obsolete.
c4short123@reddit
Usually need more than one LLM to do things anyway. You use them where they specialize. Not everyone wants a generalized LLM to handle specialized tasks. It’s designed for coding not routing agents / tools.
Healthy-Nebula-3603@reddit
Calling tools is literally a coding most important task. Model is calling commandsto test code , testing, executive functions, etc
New qwen 3.6 family or Gemini 4 family not have to be specialized to coding only as it has built-in in nature.
nikhilprasanth@reddit
Here, Qwen3-Coder-Next-MXFP4_MOE
kuhunaxeyive@reddit
Isn't the foreground moving in the wrong direction? I got the same results of moving foreground with two different models here. Or do I understand sth. wrong here?
gladkos@reddit (OP)
yep, seem model doesn't understand direction properly
kuhunaxeyive@reddit
How come different models do the exact same mistake here?
NuScorpii@reddit
It it could be moving just the right amount in the right direction to appear to be moving backwards.
gladkos@reddit (OP)
I think model messed with the layouts
mrdevlar@reddit
This is better than a Turing Test. Here is my version. Qwen3.6-35B-A3B-Q5_K_M.
cato_gts@reddit
In my bc250 Qwen all Q2 Gemma4 Q3
mrdevlar@reddit
I love this picture, look at the progress.
Evening_Ad6637@reddit
Gemma 4 dense, right?
cato_gts@reddit
No 27B-4B MoE Model
Evening_Ad6637@reddit
Wow okay, the result is not bad at all
SkyFeistyLlama8@reddit
I ran the same test on a Snapdragon X Elite, 64 GB RAM, ARM CPU inference. Llama-server build 8890 with speculative decoding on, 10 cores active, 65° C max temperature. Power draw probably around 30 W.
Qwen3.6-35B-A3B-Q4_0.gguf from Bartowski, 13 minutes, 12 t/s.
The fact that all this ran on an ultralight office laptop is mindblowing.
gladkos@reddit (OP)
Nice try! I was also surprised by M5 max power! It has 40 gpu cores, maybe that’s why gave higher tps
SkyFeistyLlama8@reddit
I'm running on ARM CPU inference because the Adreno GPU doesn't support larger MOEs. It started out at 25 t/s but it slowed down to 12 t/s at the end of the file.
Yeah the M5 Max is a beast. Having that much LLM power in a laptop is amazing.
gladkos@reddit (OP)
Nice try! I was also surprised by M5 max power! It has 40 gpu cores, maybe that’s why gave higher tps
misha1350@reddit
Why were you using Q4_0? Why not something like Unsloth's UD-Q4_K_XL?
SkyFeistyLlama8@reddit
ARM CPU and Adreno GPU requirements. Only a few quantization formats like Q4_0 and IQ4_NL support the fast matmul on these chips. If I use Q4_0, I can run the same model on CPU or GPU, depending on power requirements.
Unsloth's UD-Q4_K-XL keeps most tensors in other formats so it's a lot slower.
misha1350@reddit
I'm aware of the Adreno requirements for running on GPU, but it didn't seem like it was running on GPU, especially when the total power draw was around 30W. Did you try to run it on the NPU? It should run at better power efficiency even if the t/s count is lower. Still, 12t/s is pretty low for LPDDR5X-8500, I get 9-10t/s on DDR4-3200 on my Ryzen 5 5650U laptop running on CPU at 17W TDP which results in the total board power of 30W (with the screen and everything running).
SkyFeistyLlama8@reddit
The Adreno GPU has issues running larger MOEs. I think it's an OpenCL issue with allocating a large enough chunk of RAM for the entire model.
The NPU doesn't work for any recent LLM. Microsoft Foundry Local has some ancient models from 2024; Nexa AI SDK had some specially converted models up to Qwen 3 4B but that whole project was acquired by Qualcomm and things have been quiet since the acquisition. There's some support in llama.cpp for the Hexagon NPU but the build process is convoluted and you need to disable some Windows security features which I can't do on a work machine.
So I'm stuck with CPU inference for speed and for MOEs or the GPU for dense models running slowly.
CPU inference can pull up to 65 W but thermal throttling quickly brings it down to 30-45 W, even with a case fan pointed at the laptop's heatsink area. At low context, Qwen 35B runs at 25 t/s but at 10k it slows down to 12 t/s.
misha1350@reddit
So perhaps you should try running some GGUFs on CPU only. I'm not running anything on my Vega 7 iGPU for the reason that the RAM->CPU->iGPU interconnect is slow and it only slows things down by 30-50% and only making the power efficiency worse. Running on CPU is considerably faster to the point that local inference is actually viable for air-gapped scenarios in case I find myself needing that for anything, giving me 7.5t/s with Qwen 3 Next 80B at Q3_K_M (cannot use Q4_K_M because of virtual memory allocation constraints, maybe it's only possible with Linux), and 10GB of RAM are still remaining for the rest of the apps. There's still nothing that can bridge the gap between Qwen3.6 35B A3B and Qwen3.5 122B A10B, and Qwen 3 Next 80B is the only Sparse MoE model in that broad range, so I have to use that, and it gives me way better internal knowledge as a result. Thankfully I don't have to use it too often, I mostly live in the cloud since I don't make my living off of vibe-coding with LLMs.
CatEatsDogs@reddit
What draft model you are using?
SkyFeistyLlama8@reddit
Itself. My llama-server options:
CatEatsDogs@reddit
What's the point to use the same model as draft model for speculative decoding? Should be no beneficial at all. Is my understanding wrong?
akavel@reddit
AFAIU your understanding is right, and this is not really proper speculative decoding, but rather "self-repeat" optimization. Presumably when the server detects that a sequence of characters/tokens (?) being emitted matches one that was already emitted earlier in the session, it speculatively adds it to the output in a similar way as in proper speculative decoding. (Though I suspect it can also result in a slowdown if the speculative proposals end up being misses more often than hits?)
SkyFeistyLlama8@reddit
The Qwen 3.6 models sometimes show a significant speedup by using auto speculative decoding. Brighter minds than mine can explain why. Something to do with multiheaded attention.
kickerua@reddit
Try now with 62° C and 68° C max temperature, would it be better or worse?
rJohn420@reddit
how much context can you fit on that bad boy? I have an m5 pro with 64gb coming soon
skyyyy007@reddit
I'm getting 128k context window, qwen 3.6 35b a3b q4, about 55-70tps.
Have not tried qwen 3.6 27b yet
rJohn420@reddit
Do you find this enough for agentic coding? Are we at sonnet 4.6 levels yet? how much ram do you have left over? so excited to receive this machine.
skyyyy007@reddit
Personally from my experiences these few days (i just got my m5 pro on monday),
It doesn't feel like we are at sonnet 4.6 yet, still quite a distance from it. Still require monitoring of possible loops/hallucination/steering away(given the tasks, but after compact, it will just decide to change the tasks or self-invent other tasks)
For ram wise, loading it uses about 30gb with 128k context, it can go up to about 35-38gb using just opencode cli + lmstudio model loaded, no chrome or anything else open, under 20gb remaining as there some system ram usage also.
For reference purposes, I hit about 54gb if i run opencode cli + lm studio with model loaded + searxng on container + xcode simulator + vscode + facetime + 1 tab of chrome
gladkos@reddit (OP)
I would say up to 128K easily
Dr4x_@reddit
Way more I think, especially if you are using quantized kv cache, on 24 Gigs I can use up to 150k with kv cache at q8
_derpiii_@reddit
How do you generate prompts like that? I'm always amazed people can think of these little benchmarking projects.
IronColumn@reddit
human_imagination 86b
Xyrus2000@reddit
The average human brain has an estimated capacity equivalent to about a 100T LLM.
Yes, that's a T for trillion.
IronColumn@reddit
that's an interesting snapple fact but we don't know enough about how the brain works to make a meaningful comparison, so the word "estimated" is doing a lot of work here in what is ultimately a meaningless estimate
QuestionMarker@reddit
7 tokens context and a hardware offload bandwidth that's in bytes per second though
IronColumn@reddit
yeah i just use it for light creative tasks because of low inference costs
ThisWillPass@reddit
Uhh… human brain
A ~500T–2,000T parameter MoE with ~390 experts (ranging ~30B–3T each, or ~300B–30T with dendritic computation counted), activating ~25–150T parameters — roughly 5–15% of total — per ~100ms forward pass.
dannydeetran@reddit
https://i.redd.it/utfxdfdd18xg1.gif
here's mines, check out the twinkle in the stars and exhaust pipe.
Qwopus3.6 Q8
Xyrus2000@reddit
The 90's called and want their Game Boy Color back.
dannydeetran@reddit
I guess you wanted a screenshot instead
FoxiPanda@reddit
What were your launch parameters for these two models on this? I've managed to get Qwen3.6-27b into a loop 3 times in a row with these ones:
Treq01@reddit
This really helped me run the 3.6. Thank you.
The speculative decoding, and the mlock and the kv-unified was new to me,
and it seems to have sped up everything quite a bit on my 5090!
FoxiPanda@reddit
Happy to help. Be careful with the speculative decoding though, my settings are pretty aggressive here, so you might be better off backing off the various sizes by a factor of 2 to help avoid reasoning loops in addition to changing the presence penalty to 0.0 (for the same reason actually).
LienniTa@reddit
so what is the proper config then? also, why quantized kv cache?
FoxiPanda@reddit
--presence_penalty 1.0 was killing my model's ability to do the right thing. Setting it to 0.0 resolved my looping issue.
I also lowered the temp a bit for this particular task down to 0.8 so it was more deterministic.
Quantized cache in this case was me trading off context for quantized cache (I was running a long context workload prior to plopping this task in).
loadsamuny@reddit
heres some comparisons with gemma 4 too
https://electricazimuth.github.io/LocalLLM_VisualCodeTest/results/2026.04.23/
Available-Craft-5795@reddit
Seems like a prompt Bijan Bowen should use lol
YashN@reddit
, adding it to the Bowenchmark! :D
gladkos@reddit (OP)
haha next we'll make a full video game!
tobias_681@reddit
"Make GTA VI from scratch, make no mistakes"
gladkos@reddit (OP)
challenge accepted!
-Ellary-@reddit
All should be in a single HTML file ofc.
DocMadCow@reddit
I saw someone did space invaders so maybe go for Frogger.
lemondrops9@reddit
I've been making the classic games for over a year now using LLMs. The new models tend to do it right with less trys and look better.
OGScottingham@reddit
I've been doing all the classics. It's been a great way to play these games without bullshit ads.
danktuteja@reddit
Qwen 3.6 35B APEX I-Quality, took 5min 1s @ \~38/39 tok/s generation using Opencode
Alarmed_Wind_4035@reddit
with how faster the 35 is maybe it’s worth allowing it to do second pass and see how it handle it, the 35 allow me to run 256k context at reasonable pace with 24gb of vram the 27 can barely do 128k and it crash sometimes.
jacobpederson@reddit
Surprised Qwen 3.6 35B even finished (it's crap).
guiopen@reddit
Shouldn't the moe be 9 times faster? Here it is only 4
jkflying@reddit
There's the shared router weights as well, not just the expert, at least so I've heard.
P2070@reddit
I'm late, but this was a fun experiment and the car it made is janky and my trees are floating.
Qwen3.6-35b-a3b
183.32t/s
0.45s
101___@reddit
has the 27b lower results at all?
sacrelege@reddit
this is what FP8 looks like
DeliciousGorilla@reddit
unsloth/Qwen3.6-27B-UD-MLX-4bit in pi
maccam912@reddit
What are you using to host that model? I'm using oMLX but it seems like whatever I do it doesn't preserve the reasoning and starts looping forever.
Fantastic-Balance454@reddit
Claude 4.7 Opus attempt for reference: https://i.imgur.com/YqPz9vI.png
Tho it did read this skill after it was done thinking for 10 minutes + whatever is in the system prompt they use in the WebUI so can't really call it 1:1 prompt: https://github.com/anthropics/skills/blob/main/skills/frontend-design/SKILL.md
mister2d@reddit
Did you exhaust your weekly limit before the sun came up?
pulse77@reddit
You can not compare Claude 4.7 Opus (which uses agents in the background) with the one-shot prompt on Qwen3.6 (without any agents)! Claude 4.7 Opus can even check if code compiles/runs and fix it afterwards with additional prompts...
A more fair comparison would be to put the same prompt in OpenCode + Qwen 3.6 where agents also check the result and make fixes...
Fantastic-Balance454@reddit
I won't pretend I know what the hell is happening in the thinking phase but it did look like it used the agentic feature to only read the frontend design skill and move the html file to a display folder, thinking was shown as one shot and when it was writing the html it did it from top to bottom in one go: https://i.imgur.com/mowyCjE.jpeg
If you do it on a model that doesn't compress/summarise its reasoning to hide it, it looks plausible to me to be as close to one shot as possible. GLM 5.1 wrote so much in its thinking tokens its response got cut off when I tried to make a screenshot of it: https://pastebin.com/raw/XLgjUXW7
It wasn't iterating on the same file if that's what you're talking about, it thought for a while and then wrote the html file in one go. This is more about comparing a 27B model vs I dunno a 1T? model. I just did it for reference, can't really compare these two as they're in completely different weight classes. For a 27B model it did insanely good, way better than GLM 5.1 even, which is 754B parameters!
sacrelege@reddit
impressive, but 10minutes :-o my was done in \~50 sec
Fantastic-Balance454@reddit
Yeah it did a huge amount of thinking lol. I wonder if we can squeeze more out of QWEN if we prompt it to reason more, albeit at the cost of time?
CharacterAnimator490@reddit
Qwen 3.6 27b Q5_K_M
G-R-A-V-I-T-Y@reddit
What harness are you running it in? Claude? Openclaw?
sacrelege@reddit
opencode, took about \~52s, \~125toks/s+
JumpyAbies@reddit
I would play a game like this if it had an interesting storyline and was complete.
mrmontanasagrada@reddit
Nice! But now what happens when we give 35B 2 extra rounds to imroove? (Token/time wise that should be possible..)
I’d like to try that whenever I have a moment
Mahrkeenerh1@reddit
35b is 35ba3b, please make it clear, because now it seems like a smaller model is faster, which doesn't make sense
nikhilprasanth@reddit
Out of curiosity, I tested the prompt on earlier models mostly Q4 unsloth and it's great to see how far we've come!
DOAMOD@reddit
Yeah
misha1350@reddit
As always, Dense models are specifically suited for dGPUs, like the RX 7900 XT/XTX (20GB VRAM minimum) or Intel ARC Pro B60 24GB. They run on $900-1500 GPUs, which you have to pair with $600-1000 worth of computer parts anyway.
MoE models such as Qwen3.6 35B A3B (A3B is the distinction) are made to run on general purpose laptops like Macbooks, on mini-PCs, and others. You also don't have to spend much - it can be run easily on 36GB systems. The price for entry is lower.
Qwen3.6 35B A3B < Qwen3.6 27B < Qwen3.5 122B A10B. That's how it goes. 122B A10B can only be run on Macbooks and Strix Halo mini-PCs with 96GB RAM or higher.
nikhilprasanth@reddit
This is Qwen 3.5 27B Q3.
Healthy-Nebula-3603@reddit
Uh... Degradation :)
ea_man@reddit
Here's mine IQ3_XXS on a 12GB GPU:
Healthy-Nebula-3603@reddit
Nice
jaigouk@reddit
I ran code generation test and I ended up using qwen36-35b-a3b-iq4xs on RTX4090.
https://jaigouk.com/gpumod/benchmarks/20260423_qwen36_gemma4_comparison/
generated outputs are located in https://github.com/jaigouk/gpumod/tree/main/docs/benchmarks/20260423_qwen36_gemma4_comparison/artifacts
UDPSendToFailed@reddit
unsloth/Qwen3.6-27B-GGUF:Q4_K_M
3min 55s at 38.99 t/s on a 4090.
QuestionMarker@reddit
That time difference makes me wonder if you could just ask 35b twice, then get it to judge its own output as a third query to pick the best. Or give it a two-shot, with a second prompt of "Here's what you just produced. See what you can do to improve it". You'd still come in faster than 27b, and it would be *fascinating* to know if a chance at introspection could push it up to (or past) 27b because you can run the MoE on more restricted hardware.
skyyyy007@reddit
Currently using execution prompts with qwen 3.6 35b a3b q4, with claude sonnet and codex as reviewers of the accuracy of the tasks completion on myself ongoing project
Average after running 5 tasks, getting about 90% of the work completed well after qwen says that it is done along with tests, the remaining 10% tends to be missing parts or incorrect changes by qwen.
Sad_Steak_6813@reddit
verdict : Don't ask qwen for directions
gladkos@reddit (OP)
Left or right what’s the difference?)
kaisurniwurer@reddit
.em llet uoy
gladkos@reddit (OP)
Nice try! I was also surprised by M5 max power! It has 40 gpu cores, maybe that’s why gave higher tps
aeroumbria@reddit
I wonder if these vision-capable models is able to effectively figure out how to check its own animation outputs. Checking static renders or plots seems to work fine, but videos and animation are always quite tricky.
AvidCyclist250@reddit
My experience is that the moe version wants to be harnessed and bossed around. It likes that.
Yes_but_I_think@reddit
Really like these short video style things for comparison. Crisp and to the point. Thanks for my time.
Technical-Earth-3254@reddit
Nice test, what quants did you use?
gladkos@reddit (OP)
thanks! q4 GGUF quantized
RnRau@reddit
I would be interested to see if a q6 or a q8 on the 35B would make a good bit of difference. Apparently the smaller the activation for moe's, the more quantisation hurts.
lolwutdo@reddit
That's what I thought too, but from my experience qwen 35b seemed more quant resistant than 27b.
-Ellary-@reddit
Vision takes a noticeable hit on q4 vs q6.
AppealThink1733@reddit
I think where will have AI 4B parameter doing the same.
gladkos@reddit (OP)
Likely, as 35b uses only 3B simultaneously
AppealThink1733@reddit
The question is when ?
Paradigmind@reddit
Qwen?
AppealThink1733@reddit
Any 4B parameter. I think when will have a 4B parameter doing the same ?
TableSurface@reddit
I had the same experience. The 3-4x speed is great for easy tasks though. Another thing to try is to have the 27B model create a plan for the 35B-A3B one.
gladkos@reddit (OP)
Nice idea! I guess the reason 35B-A3B uses only 3b parameters simultaneously