Compared QWEN 3.6 35B with QWEN 3.6 27B for coding primitives

Posted by gladkos@reddit | LocalLLaMA | View on Reddit | 123 comments

Compared Qwen 3.6 35B model with the New Qwen 3.6 35B boosted by TurboQuant.

MacBook Pro M5 MAX 64GB
Qwen 3.6 35B - 72 TPS
Qwen 3.6 27B - 18 TPS

Tested coding primitives. The 27B model thinks more, but the result is more precise and correct. The 35B model handled the task worse, but did it faster.

What's your experience?

Prompt: Write a single HTML file with a full-page canvas and no libraries. Simulate a realistic side-view of a moving car as the main subject. Keep the car visible in the foreground while the background landscape scrolls continuously to create the feeling that the car is driving forward. Use layered scenery for depth: nearby ground, roadside elements, trees, poles, and distant hills or mountains should move at different speeds for a natural parallax effect. Animate the wheels spinning realistically and add subtle body motion so the car feels connected to the road. Let the environment pass smoothly behind it, with repeating but varied scenery that makes the movement feel believable. Use cinematic lighting and a cohesive sky, such as sunset, dusk, or daylight, to enhance atmosphere. The overall motion should feel calm, immersive, and realistic, with a seamless looping animation.

[-]

thedatawhiz@reddit

What is the stack behind this demo?

[-]

gladkos@reddit (OP)

MacBook Pro M5 MAX 64GB
llama.cpp with turboquant via atomic.chat app

[-]

MrTacoSauces@reddit

I wonder where these models are finding enough training examples to imagine and visualize in code/SVG at a meaningful scale generate a whole scene not part of the training.

For a model that's trained on vision I believe that's a separate part of the model that is related to the LLM part of the weights. Is the vision part able to relate it's world view/"attention" of imagery to drive the main part of the weights to generate a scene to the users prompt?

I understood chatgpt/Claude generating decent enough looking svgs of simple objects by brute force of available svg data to sort of understand an object even through chain of thought reasoning. But this round of small models generating scenes even at the opus scale is confusing. It seems like a general world view is slowly but if not persistently being crammed into models at every scale. I'd love to be a fly on the wall of one these training teams.

[-]

gffcdddc@reddit

Semantic links in the actual training between tokens. It can generalize well.

[-]

Healthy-Nebula-3603@reddit

Vision is not a separate part.

Actually you can even add a vision decoder (decoder is an module that simulate an eye in simple words ) to a not vision model trained and will be working!

That's is the weirdest part

[-]

Remarkable_Living_80@reddit

https://i.redd.it/w4cknkjqzmxg1.gif

You guys sure you are using correct parameters?

Here is first try with 3.6 27B iq4_xs.gguf --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 --repeat-penalty 1 --presence_penalty 0 --multiline-input --ctx-size 32768 --no-mmap

[-]

shokuninstudio@reddit

On an M4 Mac 48GB:

qwen3.6:27b-q4_K_M consumed far more memory than the quant should have, made the system jerky, and froze.
gemma4:26b-a4b-it-q4_K_M produced a black html page and when it tried to fix it the output got stuck in a broken infinite loop with dumb comments at the end of it...

ons_list/sessions_history/sessions_send/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/write/exec/process/subagents/sessions_list/sessions_history/sessions_send/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/weite/exec/process/subagents/sessions_list/sessions_history/sessions_send/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/write/exec/process/subagents/sessions_list/sessions_history/sessions_send/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/write/exec/process/subagents/sessions_list/sessions_history/sessions_end/sessions_spawn/session_status/memory_get/memory_search/web_search/web_fetch/image/edit/read/write/exec/process/subagents/sessions_list/sessions_history/sessions_send/sessions_spawn/session_statusmemory_get/memory_search/web_search/web_fetch/image/edit/read/// (rest of the tool list is long and messy in my internal thinking, but you get the point) (I won't repeat that disaster again. I'm cleaning it up.)

[-]

c4short123@reddit

Can you compare with qwen 3 coder next?

[-]

Healthy-Nebula-3603@reddit

Why?

That model is very old

[-]

c4short123@reddit

Ask Claude to do an assessment of the two. Seems to think coder next still has value as a coder ahead of the new generalist models. I’m wondering what the benchmark is.

[-]

Healthy-Nebula-3603@reddit

What kind of value? That old model in not even working under agents environment a s tool calling.

Is obsolete.

[-]

c4short123@reddit

Usually need more than one LLM to do things anyway. You use them where they specialize. Not everyone wants a generalized LLM to handle specialized tasks. It’s designed for coding not routing agents / tools.

[-]

Healthy-Nebula-3603@reddit

Calling tools is literally a coding most important task. Model is calling commandsto test code , testing, executive functions, etc

New qwen 3.6 family or Gemini 4 family not have to be specialized to coding only as it has built-in in nature.

[-]

nikhilprasanth@reddit

Here, Qwen3-Coder-Next-MXFP4_MOE

[-]

kuhunaxeyive@reddit

Isn't the foreground moving in the wrong direction? I got the same results of moving foreground with two different models here. Or do I understand sth. wrong here?

[-]

gladkos@reddit (OP)

yep, seem model doesn't understand direction properly

[-]

kuhunaxeyive@reddit

How come different models do the exact same mistake here?

[-]

NuScorpii@reddit

It it could be moving just the right amount in the right direction to appear to be moving backwards.

[-]

gladkos@reddit (OP)

I think model messed with the layouts

[-]

mrdevlar@reddit

This is better than a Turing Test. Here is my version. Qwen3.6-35B-A3B-Q5_K_M.

[-]

cato_gts@reddit

In my bc250 Qwen all Q2 Gemma4 Q3

[-]

mrdevlar@reddit

I love this picture, look at the progress.

[-]

Evening_Ad6637@reddit

Gemma 4 dense, right?

[-]

cato_gts@reddit

No 27B-4B MoE Model

[-]

Evening_Ad6637@reddit

Wow okay, the result is not bad at all

[-]

SkyFeistyLlama8@reddit

I ran the same test on a Snapdragon X Elite, 64 GB RAM, ARM CPU inference. Llama-server build 8890 with speculative decoding on, 10 cores active, 65° C max temperature. Power draw probably around 30 W.

Qwen3.6-35B-A3B-Q4_0.gguf from Bartowski, 13 minutes, 12 t/s.

The fact that all this ran on an ultralight office laptop is mindblowing.

[-]

gladkos@reddit (OP)

Nice try! I was also surprised by M5 max power! It has 40 gpu cores, maybe that’s why gave higher tps

[-]

SkyFeistyLlama8@reddit

I'm running on ARM CPU inference because the Adreno GPU doesn't support larger MOEs. It started out at 25 t/s but it slowed down to 12 t/s at the end of the file.

Yeah the M5 Max is a beast. Having that much LLM power in a laptop is amazing.

[-]

gladkos@reddit (OP)

Nice try! I was also surprised by M5 max power! It has 40 gpu cores, maybe that’s why gave higher tps

[-]

misha1350@reddit

Why were you using Q4_0? Why not something like Unsloth's UD-Q4_K_XL?

[-]

SkyFeistyLlama8@reddit

ARM CPU and Adreno GPU requirements. Only a few quantization formats like Q4_0 and IQ4_NL support the fast matmul on these chips. If I use Q4_0, I can run the same model on CPU or GPU, depending on power requirements.

Unsloth's UD-Q4_K-XL keeps most tensors in other formats so it's a lot slower.

[-]

misha1350@reddit

I'm aware of the Adreno requirements for running on GPU, but it didn't seem like it was running on GPU, especially when the total power draw was around 30W. Did you try to run it on the NPU? It should run at better power efficiency even if the t/s count is lower. Still, 12t/s is pretty low for LPDDR5X-8500, I get 9-10t/s on DDR4-3200 on my Ryzen 5 5650U laptop running on CPU at 17W TDP which results in the total board power of 30W (with the screen and everything running).

[-]

SkyFeistyLlama8@reddit

The Adreno GPU has issues running larger MOEs. I think it's an OpenCL issue with allocating a large enough chunk of RAM for the entire model.

The NPU doesn't work for any recent LLM. Microsoft Foundry Local has some ancient models from 2024; Nexa AI SDK had some specially converted models up to Qwen 3 4B but that whole project was acquired by Qualcomm and things have been quiet since the acquisition. There's some support in llama.cpp for the Hexagon NPU but the build process is convoluted and you need to disable some Windows security features which I can't do on a work machine.

So I'm stuck with CPU inference for speed and for MOEs or the GPU for dense models running slowly.

CPU inference can pull up to 65 W but thermal throttling quickly brings it down to 30-45 W, even with a case fan pointed at the laptop's heatsink area. At low context, Qwen 35B runs at 25 t/s but at 10k it slows down to 12 t/s.

[-]

misha1350@reddit

So perhaps you should try running some GGUFs on CPU only. I'm not running anything on my Vega 7 iGPU for the reason that the RAM->CPU->iGPU interconnect is slow and it only slows things down by 30-50% and only making the power efficiency worse. Running on CPU is considerably faster to the point that local inference is actually viable for air-gapped scenarios in case I find myself needing that for anything, giving me 7.5t/s with Qwen 3 Next 80B at Q3_K_M (cannot use Q4_K_M because of virtual memory allocation constraints, maybe it's only possible with Linux), and 10GB of RAM are still remaining for the rest of the apps. There's still nothing that can bridge the gap between Qwen3.6 35B A3B and Qwen3.5 122B A10B, and Qwen 3 Next 80B is the only Sparse MoE model in that broad range, so I have to use that, and it gives me way better internal knowledge as a result. Thankfully I don't have to use it too often, I mostly live in the cloud since I don't make my living off of vibe-coding with LLMs.

[-]

CatEatsDogs@reddit

What draft model you are using?

[-]

SkyFeistyLlama8@reddit

Itself. My llama-server options:

--no-mmap --cache-ram -1 --batch-size 512 --ubatch-size 512 -c 70000 -np 1 -t 10 --temp 1.0 --top-p 0.95 --top-k 20 --chat-template-kwargs '{\"preserve_thinking\":true}' --presence_penalty 1.5 -fa 1 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48

[-]

CatEatsDogs@reddit

What's the point to use the same model as draft model for speculative decoding? Should be no beneficial at all. Is my understanding wrong?

[-]

akavel@reddit

AFAIU your understanding is right, and this is not really proper speculative decoding, but rather "self-repeat" optimization. Presumably when the server detects that a sequence of characters/tokens (?) being emitted matches one that was already emitted earlier in the session, it speculatively adds it to the output in a similar way as in proper speculative decoding. (Though I suspect it can also result in a slowdown if the speculative proposals end up being misses more often than hits?)

[-]

SkyFeistyLlama8@reddit

The Qwen 3.6 models sometimes show a significant speedup by using auto speculative decoding. Brighter minds than mine can explain why. Something to do with multiheaded attention.

[-]

kickerua@reddit

Try now with 62° C and 68° C max temperature, would it be better or worse?

[-]

rJohn420@reddit

how much context can you fit on that bad boy? I have an m5 pro with 64gb coming soon

[-]

skyyyy007@reddit

I'm getting 128k context window, qwen 3.6 35b a3b q4, about 55-70tps.

Have not tried qwen 3.6 27b yet

[-]

rJohn420@reddit

Do you find this enough for agentic coding? Are we at sonnet 4.6 levels yet? how much ram do you have left over? so excited to receive this machine.

[-]

skyyyy007@reddit

Personally from my experiences these few days (i just got my m5 pro on monday),

It doesn't feel like we are at sonnet 4.6 yet, still quite a distance from it. Still require monitoring of possible loops/hallucination/steering away(given the tasks, but after compact, it will just decide to change the tasks or self-invent other tasks)

For ram wise, loading it uses about 30gb with 128k context, it can go up to about 35-38gb using just opencode cli + lmstudio model loaded, no chrome or anything else open, under 20gb remaining as there some system ram usage also.

For reference purposes, I hit about 54gb if i run opencode cli + lm studio with model loaded + searxng on container + xcode simulator + vscode + facetime + 1 tab of chrome

[-]

gladkos@reddit (OP)

I would say up to 128K easily

[-]

Dr4x_@reddit

Way more I think, especially if you are using quantized kv cache, on 24 Gigs I can use up to 150k with kv cache at q8

[-]

_derpiii_@reddit

How do you generate prompts like that? I'm always amazed people can think of these little benchmarking projects.

[-]

IronColumn@reddit

human_imagination 86b

[-]

Xyrus2000@reddit

The average human brain has an estimated capacity equivalent to about a 100T LLM.

Yes, that's a T for trillion.

[-]

IronColumn@reddit

that's an interesting snapple fact but we don't know enough about how the brain works to make a meaningful comparison, so the word "estimated" is doing a lot of work here in what is ultimately a meaningless estimate

[-]

QuestionMarker@reddit

7 tokens context and a hardware offload bandwidth that's in bytes per second though

[-]

IronColumn@reddit

yeah i just use it for light creative tasks because of low inference costs

[-]

ThisWillPass@reddit

Uhh… human brain

A ~500T–2,000T parameter MoE with ~390 experts (ranging ~30B–3T each, or ~300B–30T with dendritic computation counted), activating ~25–150T parameters — roughly 5–15% of total — per ~100ms forward pass.

[-]

dannydeetran@reddit

https://i.redd.it/utfxdfdd18xg1.gif

here's mines, check out the twinkle in the stars and exhaust pipe.

Qwopus3.6 Q8

[-]

Xyrus2000@reddit

The 90's called and want their Game Boy Color back.

[-]

dannydeetran@reddit

I guess you wanted a screenshot instead

[-]

FoxiPanda@reddit

What were your launch parameters for these two models on this? I've managed to get Qwen3.6-27b into a loop 3 times in a row with these ones:

  --model "C:\llama.cpp\models\Qwen3.6-27B-UD-Q5_K_XL.gguf" `
  --mmproj "C:\llama.cpp\models\Qwen-3.6-27B-mmproj-BF16.gguf" `
  --no-mmproj-offload `
  --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 12 --draft-max 48 `
  --n-gpu-layers 999 `
  --ctx-size 262144 `
  --parallel 2 `
  --threads 16 `
  --temp 1.0 `
  --top-p 0.95 `
  --min-p 0.00 `
  --top-k 20 `
  --repeat-penalty 1.1 `
  --presence_penalty 1.0 `
  --chat-template-kwargs '{\"preserve_thinking\": true}' `
  --mlock `
  --flash-attn on `
  --cache-type-k q8_0 `
  --cache-type-v q8_0 `
  --kv-unified `

[-]

Treq01@reddit

This really helped me run the 3.6. Thank you.
The speculative decoding, and the mlock and the kv-unified was new to me,
and it seems to have sped up everything quite a bit on my 5090!

[-]

FoxiPanda@reddit

Happy to help. Be careful with the speculative decoding though, my settings are pretty aggressive here, so you might be better off backing off the various sizes by a factor of 2 to help avoid reasoning loops in addition to changing the presence penalty to 0.0 (for the same reason actually).

[-]

LienniTa@reddit

so what is the proper config then? also, why quantized kv cache?

[-]

FoxiPanda@reddit

--presence_penalty 1.0 was killing my model's ability to do the right thing. Setting it to 0.0 resolved my looping issue.

I also lowered the temp a bit for this particular task down to 0.8 so it was more deterministic.

Quantized cache in this case was me trading off context for quantized cache (I was running a long context workload prior to plopping this task in).

[-]

loadsamuny@reddit

heres some comparisons with gemma 4 too

https://electricazimuth.github.io/LocalLLM_VisualCodeTest/results/2026.04.23/

[-]

Available-Craft-5795@reddit

Seems like a prompt Bijan Bowen should use lol

[-]

YashN@reddit

, adding it to the Bowenchmark! :D

[-]

gladkos@reddit (OP)

haha next we'll make a full video game!

[-]

tobias_681@reddit

"Make GTA VI from scratch, make no mistakes"

[-]

gladkos@reddit (OP)

challenge accepted!

[-]

-Ellary-@reddit

All should be in a single HTML file ofc.

[-]

DocMadCow@reddit

I saw someone did space invaders so maybe go for Frogger.

[-]

lemondrops9@reddit

I've been making the classic games for over a year now using LLMs. The new models tend to do it right with less trys and look better.

[-]

OGScottingham@reddit

I've been doing all the classics. It's been a great way to play these games without bullshit ads.

[-]

danktuteja@reddit

Qwen 3.6 35B APEX I-Quality, took 5min 1s @ \~38/39 tok/s generation using Opencode

[-]

Alarmed_Wind_4035@reddit

with how faster the 35 is maybe it’s worth allowing it to do second pass and see how it handle it, the 35 allow me to run 256k context at reasonable pace with 24gb of vram the 27 can barely do 128k and it crash sometimes.

[-]

jacobpederson@reddit

Surprised Qwen 3.6 35B even finished (it's crap).

[-]

guiopen@reddit

Shouldn't the moe be 9 times faster? Here it is only 4

[-]

jkflying@reddit

There's the shared router weights as well, not just the expert, at least so I've heard.

[-]

P2070@reddit

I'm late, but this was a fun experiment and the car it made is janky and my trees are floating.

Qwen3.6-35b-a3b

183.32t/s

0.45s

[-]

101___@reddit

has the 27b lower results at all?

[-]

sacrelege@reddit

this is what FP8 looks like

[-]

DeliciousGorilla@reddit

unsloth/Qwen3.6-27B-UD-MLX-4bit in pi

[-]

maccam912@reddit

What are you using to host that model? I'm using oMLX but it seems like whatever I do it doesn't preserve the reasoning and starts looping forever.

[-]

Fantastic-Balance454@reddit

Claude 4.7 Opus attempt for reference: https://i.imgur.com/YqPz9vI.png

Tho it did read this skill after it was done thinking for 10 minutes + whatever is in the system prompt they use in the WebUI so can't really call it 1:1 prompt: https://github.com/anthropics/skills/blob/main/skills/frontend-design/SKILL.md

[-]

mister2d@reddit

Did you exhaust your weekly limit before the sun came up?

[-]

pulse77@reddit

You can not compare Claude 4.7 Opus (which uses agents in the background) with the one-shot prompt on Qwen3.6 (without any agents)! Claude 4.7 Opus can even check if code compiles/runs and fix it afterwards with additional prompts...

A more fair comparison would be to put the same prompt in OpenCode + Qwen 3.6 where agents also check the result and make fixes...

[-]

Fantastic-Balance454@reddit

I won't pretend I know what the hell is happening in the thinking phase but it did look like it used the agentic feature to only read the frontend design skill and move the html file to a display folder, thinking was shown as one shot and when it was writing the html it did it from top to bottom in one go: https://i.imgur.com/mowyCjE.jpeg

If you do it on a model that doesn't compress/summarise its reasoning to hide it, it looks plausible to me to be as close to one shot as possible. GLM 5.1 wrote so much in its thinking tokens its response got cut off when I tried to make a screenshot of it: https://pastebin.com/raw/XLgjUXW7

It wasn't iterating on the same file if that's what you're talking about, it thought for a while and then wrote the html file in one go. This is more about comparing a 27B model vs I dunno a 1T? model. I just did it for reference, can't really compare these two as they're in completely different weight classes. For a 27B model it did insanely good, way better than GLM 5.1 even, which is 754B parameters!

[-]

sacrelege@reddit

impressive, but 10minutes :-o my was done in \~50 sec

[-]

Fantastic-Balance454@reddit

Yeah it did a huge amount of thinking lol. I wonder if we can squeeze more out of QWEN if we prompt it to reason more, albeit at the cost of time?

[-]

CharacterAnimator490@reddit

Qwen 3.6 27b Q5_K_M

[-]

G-R-A-V-I-T-Y@reddit

What harness are you running it in? Claude? Openclaw?

[-]

sacrelege@reddit

opencode, took about \~52s, \~125toks/s+

[-]

JumpyAbies@reddit

I would play a game like this if it had an interesting storyline and was complete.

[-]

mrmontanasagrada@reddit

Nice! But now what happens when we give 35B 2 extra rounds to imroove? (Token/time wise that should be possible..)

I’d like to try that whenever I have a moment

[-]

Mahrkeenerh1@reddit

35b is 35ba3b, please make it clear, because now it seems like a smaller model is faster, which doesn't make sense

[-]

nikhilprasanth@reddit

Out of curiosity, I tested the prompt on earlier models mostly Q4 unsloth and it's great to see how far we've come!

[-]

DOAMOD@reddit

Yeah

[-]

misha1350@reddit

As always, Dense models are specifically suited for dGPUs, like the RX 7900 XT/XTX (20GB VRAM minimum) or Intel ARC Pro B60 24GB. They run on $900-1500 GPUs, which you have to pair with $600-1000 worth of computer parts anyway.

MoE models such as Qwen3.6 35B A3B (A3B is the distinction) are made to run on general purpose laptops like Macbooks, on mini-PCs, and others. You also don't have to spend much - it can be run easily on 36GB systems. The price for entry is lower.

Qwen3.6 35B A3B < Qwen3.6 27B < Qwen3.5 122B A10B. That's how it goes. 122B A10B can only be run on Macbooks and Strix Halo mini-PCs with 96GB RAM or higher.

[-]

nikhilprasanth@reddit

This is Qwen 3.5 27B Q3.

[-]

Healthy-Nebula-3603@reddit

Uh... Degradation :)

[-]

ea_man@reddit

Here's mine IQ3_XXS on a 12GB GPU:

llama-server \
-m /home/eaman/lm/models/mradermacher/Qwen3.6-27B.i1-IQ3_XXS.gguf \
       --host 0.0.0.0 \
       -np 1 \
       --fit-target 50 \
       -ctk q4_0 \
       -ctv q4_0 \
       -fa on \
       --temp 0.7 \
       --repeat-penalty 1.0 \
       --top-p 0.80 \
       --top-k 20 \
       --min-p 0.0 \
       --presence_penalty 1.5 \
       -b 512 \
       --jinja  \
       --reasoning-budget 1 \
       --chat-template-kwargs '{"enable_thinking":false}' \
       --no-mmap \

[-]

Healthy-Nebula-3603@reddit

Nice

[-]

jaigouk@reddit

I ran code generation test and I ended up using qwen36-35b-a3b-iq4xs on RTX4090.

https://jaigouk.com/gpumod/benchmarks/20260423_qwen36_gemma4_comparison/

generated outputs are located in https://github.com/jaigouk/gpumod/tree/main/docs/benchmarks/20260423_qwen36_gemma4_comparison/artifacts

[-]

UDPSendToFailed@reddit

unsloth/Qwen3.6-27B-GGUF:Q4_K_M

3min 55s at 38.99 t/s on a 4090.

[-]

QuestionMarker@reddit

That time difference makes me wonder if you could just ask 35b twice, then get it to judge its own output as a third query to pick the best. Or give it a two-shot, with a second prompt of "Here's what you just produced. See what you can do to improve it". You'd still come in faster than 27b, and it would be *fascinating* to know if a chance at introspection could push it up to (or past) 27b because you can run the MoE on more restricted hardware.

[-]

skyyyy007@reddit

Currently using execution prompts with qwen 3.6 35b a3b q4, with claude sonnet and codex as reviewers of the accuracy of the tasks completion on myself ongoing project

Average after running 5 tasks, getting about 90% of the work completed well after qwen says that it is done along with tests, the remaining 10% tends to be missing parts or incorrect changes by qwen.

[-]

Sad_Steak_6813@reddit

verdict : Don't ask qwen for directions

[-]

gladkos@reddit (OP)

Left or right what’s the difference?)

[-]

kaisurniwurer@reddit

.em llet uoy

[-]

gladkos@reddit (OP)

Nice try! I was also surprised by M5 max power! It has 40 gpu cores, maybe that’s why gave higher tps

[-]

aeroumbria@reddit

I wonder if these vision-capable models is able to effectively figure out how to check its own animation outputs. Checking static renders or plots seems to work fine, but videos and animation are always quite tricky.

[-]

AvidCyclist250@reddit

My experience is that the moe version wants to be harnessed and bossed around. It likes that.

[-]

Yes_but_I_think@reddit

Really like these short video style things for comparison. Crisp and to the point. Thanks for my time.

[-]

Technical-Earth-3254@reddit

Nice test, what quants did you use?

[-]

gladkos@reddit (OP)

thanks! q4 GGUF quantized

[-]

gladkos@reddit (OP)

Likely, as 35b uses only 3B simultaneously

[-]

The 35B model handled the task worse, but did it faster.

I had the same experience. The 3-4x speed is great for easy tasks though. Another thing to try is to have the 27B model create a plan for the 35B-A3B one.

[-]

gladkos@reddit (OP)

Nice idea! I guess the reason 35B-A3B uses only 3b parameters simultaneously