Qwen3.6-35B-A3B vs Gemma4-26B-A4B

[-]

Endurance_Beast@reddit

In linguistics, gemma4 26b has the upperhand, in everything else the qwen 35b is better.

[-]

I tested four models: Gemma4 dense/MoE, Qwen3.6 dense/MoE. The Gemma models were best at analyzing my art from a Jungian perspective (oddly, the MoE was better than dense) and discussing that type of thing with a bit of depth. Not as deep a comprehension as frontier models, but hey it's private. There's only so much I'm going to tell a model from OpenAI, Anthropic or Grok, no matter how clever they are.

[-]

Embarrassed-Area4652@reddit

Ditto in which direction?

[-]

Much-Researcher6135@reddit

apologies, see edit

[-]

speedb0at@reddit

qwen is overwhelmingly better

[-]

Borkato@reddit

Except for RP/writing/prose/summarizing/natural English

[-]

speedb0at@reddit

I use my shit to code not play pretend

[-]

ai_without_borders@reddit

running qwen3.6 35b q6 for most coding and agentic stuff, gemma4 26b for quick summarization where i need the throughput. main difference i notice is tool call reliability — qwen is rock solid across long sessions, gemma starts hallucinating tool schemas around context 60-80k. bartowski q6 over unsloth UD4 for me, the context degradation with MTP quants is real on longer tasks

[-]

AcrobaticChain1846@reddit

Except gemma I agree on majority of it.

I haven't used gemma much.

For bartowski MTP what are the max MTP token you set? For unsloth it's suggested to use 6

I tried same 6 for bartowski but acceptance rate is like 50-70%

[-]

pmttyji@reddit

Right now, most people are busy with Qwen3.6 models after recent MTP feature merged on llama.cpp. Once this PR merged, they would play with Gemma-4 models side by side with Qwen3.6.

[-]

Much-Researcher6135@reddit

HYPE!

[-]

superdariom@reddit

Except that MTP for the 35b model really doesn't give any speed up at all. 27b different story...

[-]

10F1@reddit

I gained 20 tps on 35b using vulkan on 7900xtx

[-]

superdariom@reddit

Yes sorry I'm running 35b at Q8 on the same GPU as you. It also slows down prompt processing.

[-]

10F1@reddit

Q8 would offload to the CPU, a lot.

I'm running q4_k_s, getting ~110tps with 128k ctx.

[-]

Client_Hello@reddit

I did see about +30% when enabling MTP for 35b, definitely worth it. The people seeing little benefit likely have layers split between cpu and gpu.

27b shows a larger boost here since it has to run on gpu. I got +80% ON 27b

[-]

TonyStark1500@reddit

I’ve been testing out different models lately, I have an RTX 4090 and have Hermes setup. How can I decide between Qwen3.6 27B vs 35B A3B? Seems so tough, both have pros and cons from the research I’ve done so far?

[-]

Borkato@reddit

27B blows 35B out of the water when coding. Only run 35B for simple coding tasks, but even then I’d say it’ll frustrate you so often you should just stick with the slower 27B because at least it’s almost always right.

35B is a great for “I just need an answer, it’s really not that hard” kind of questions. Like “what are some good words to describe food for my game, random, comma separated list” or “refactor this function so that the parameters have sensible defaults” or “finish this simple bash script, I forgot how to do the image part”

27B is great for every kind of coding. I love it so much. I don’t use either with thinking btw.

And be sure you’re using MTP!

If you’re roleplaying or writing or summarizing, use Gemma instead. But for anything coding use qwen - though Gemma is ok at it, ish. I find it takes up much more space and is just overall less satisfactory for coding for some reason

[-]

superdariom@reddit

27b at Q4 hallucinated a security flaw in Aws configuration that caused me to waste several hours trying to troubleshoot it. 35b at Q8 didn't make the same mistake.

[-]

TonyStark1500@reddit

Fascinating, thanks for the reply! I’ve been mostly using Qwen3.6:27B lately, I think I agree it’s the best, and yes I’ve been using MTP! I’ve been struggling with how much context to allow, OpenCode seems to work pretty well with like 64K but I’m not sure Hermes likes that as much, but I’m not really sure. But was wondering if context wouldn’t matter as much if I switched to the MoE model since it’s so much faster tokens per second.

[-]

Borkato@reddit

Hmmm that I don’t know! I roll my own incredibly simple framework, I don’t trust opencode. They don’t even let you disable telemetry completely, it still phones home to the Llm list. But that’s a me problem haha.

I’ve heard pi has shorter system prompts! I’ve never used it though

[-]

TonyStark1500@reddit

Oh wait this is great to know, I thought OpenCode was fully local and private!! I’ll have to look into Pi and maybe other alternatives!!

[-]

simon_zzz@reddit

I use both regularly on 24GB VRAM.

Qwen is better, qualitatively. Better at following instructions and calling tools.

Gemma is fast but I only use it for summarization.

[-]

FerLuisxd@reddit

Gemma is faster? How mhch faster

[-]

Much-Researcher6135@reddit

Here's my speed test on a 3090 with the gemma4/qwen3.6 models:

model	test	t/s	peak t/s
qwen36_35b_a3b	tg128	141.40 ± 0.83	142.54 ± 0.85
qwen36_27b_dense	tg128	37.30 ± 0.10	38.00 ± 0.00
gemma4_26b_a4b	tg128	128.84 ± 0.29	129.90 ± 0.29
gemma4_31b_dense	tg128	35.83 ± 0.09	36.20 ± 0.40

[-]

Client_Hello@reddit

Gemma 26B can be faster on 24gb because its easier to fit entirely in VRAM. Things flip if you can run Qwen 35B in VRAM. It becomes faster due to fewer active parameters.

[-]

Capable-Guide98@reddit

Newbie here but my understanding was that Gemma was a moe model, so it's execution is faster as it does not have to evaluate all of the parameters. Is it not the case?

[-]

CooperDK@reddit

Gemma is also better for general chat and roleplay.

[-]

nickm_27@reddit

I see a number of people saying Gemma does not do well with tool calling but that is not my experience, using it for tool calling like web search, weather forecasts, place searching, device control, etc. it works flawlessly.

For me Qwen works well at this too except it is way too chatty and refuses to follow instructions about being brief / concise responses, making it much more frustrating to use for my use cases.

Gemma follows instructions much more naturally and easily.

[-]

TheHiveFather@reddit

Run both, like both probably lean more towards Qwen, but run my own interface so I cant speak to off the shelf. The differences are subtle I would say, it comes down to tasks and expectations more than capability. I try to test them based on my work flows within specified roles so they essentially finding the best use case in my stack.

[-]

anykeyh@reddit

KV cache on Qwen is so much better;

On my setup (strix), because of that I can run 4\~6 parallel workflow and get saturated GPU compute instead of being memory bound, so \~85tps overall.

With Gemma, KV cache increase way faster with context because it's a different architecture. Qwen stays under 10G for 256k context Q8 (Gated Delta Net); I cannot do that with Gemma, it uses way more memory once context starts to fill.

[-]

Loud-Swim-2932@reddit

I am running Gemma-4-31B-IT-NVFP4 with MTP assistant on a dgx spark. Pretty fast and reliable for OCR, translations, text corrections and basic tool usage. In my simple test setup it performs better then Qwen.
For me - both failed at coding.

[-]

reddit_kwr@reddit

Tired benchmarking both on BIRD SQL in as identical a condition as we could. Gemma beats Qwen by a margin. (20% vs 12%).

This is a more complex benchmark with multiple steps, exploring the environment, actively asking for feedback and making complex plans. It's not a simple QnA benchmark.

Was absolutely not expecting this. Both were on 4 bit quantized. Reasons could be:

qwen quants degrade more
we just didn't know how to setup qwen properly
qwen is a good coder but simply not good when a lot of other dynamics are involved

If people are interested I can look at the data and see where qwen fails more often. But also, would love to have someone give me the best qwen config to run the eval with. I have 1x5090

[-]

Borkato@reddit

From ooba’s test, Gemma does horrendously with quantization while qwen does much better. May want to try comparing larger quant if you can.

I feel like we really need a unified place for benchmarks lol

[-]

blackhawkx12@reddit

Qwen all the way, its actually really interesting for chinese open source model now really fights to par, cause competitiveness mean better options for customer.

[-]

Rektile142@reddit

I use Gemma 4 26B A4B as my daily driver in 4 bit quant, and I have not had a single issue. It chains tool calls flawlessly, although I do not stress the context window (~100k max).

It trades blows with Qwen 3.6 27B for my use case, which is web research and system execution via command line use. For coding only, folks seem to prefer the Qwen models.

[-]

trentvb@reddit

Just my experience: running both MOEs with unsloths q4x on 12gb vram, qwen3.6 is faster. However, I've found it not very good at writing Go and prefer gemma4 in that area.

[-]

Designer_Elephant227@reddit

I use 35b Q5 and 26b Q4. I got many problems with tool calls with Gemma and literally none with qwen.

[-]

TheDapperYank@reddit

How well do they do with looping? I alternate between Qwen3.6 35b and 27b, and I find that 35b tends to loop on medium length tasks (anything longer than \~32k context). Not sure if that's a feature of it being a small MoE model.

[-]

Gesha24@reddit

Try Gemma with opencode - it does work with its tools. I still prefer Qwen, but sometimes it's nice to have a different model available. That said, at least mine Gemma turns into a pumpkin by 100K context and starts struggling with tools again

[-]

okoyl3@reddit

I ran unsloth Qwen3.6 35b-a3b UD4 xl with opencode, felt like Claude code.

[-]

SlechteConcentratie@reddit

On which hardware?

[-]

jacknjill101@reddit

I’m running it on a M4 Mac mini 64gb.

[-]

AcrobaticChain1846@reddit

I switched to bartowski as I was having troubles with unsloth's MTP models as the context was getting filled the tk/s drop was significant.

The bartowski q6 doesn't have that much degradation gives consistent 60-90 tk/s whereas unsloth's drop is like 30-40 tk/s as context is getting full.

[-]

ea_man@reddit

Barto IQ3 quants are preatty interesting for anyone who wants fast reply with little VRAM as \~16GB, go check the tensors:

[-]

jacknjill101@reddit

That’s why you should as part of your workflow compact your session.

[-]

Potential-Leg-639@reddit

Correct answer. Best quant!

Running UD-Q8 recently and cant really spot a difference to UD-Q4 tbh.

[-]

Jorlen@reddit

I recently switched to Pi / Pi.dev which is a CLI-based super lightweight coding agent. I'm not a big CLI (command-line interface) dev yet, still prefer IDE like VsCodium but thankfully there is a vscode extension called "Pi Agent for Vs code" that links vscode directly to Pi once it's installed. I set it to only do edit, write and read calls and with this, Gemma 4 seems to work perfectly.

I think Gemma4 stumbles on very complex stuff like continue.dev and so on but if you give it a minimalist setup, it works really well (so far - about one hour in tests).

[-]

fatboy93@reddit

Wow, thats the exact opposite for me. Qwen just runs into a single token print with "!" on both GGUFs and MLX for me :/

Gemma runs decently.

[-]

JsThiago5@reddit

Here i use qwen 27b dense for coding, sometimes 35b if i need speed, and gemma for everything else.

[-]

ideal2545@reddit

Gemma as a daily ai, discusser, idea generator blah blah, code with qwen

[-]

xandep@reddit

-- "Love with your Gemma, use your Qwen for everything else"

[-]

Sisaroth@reddit

i see a Captain Disillusion fan

[-]

No_Swimming6548@reddit

Gemma for RP, Qwen for everything else

[-]

ceo_of_banana@reddit

RP?

[-]

VoidAlchemy@reddit

role play (narrative chat workload as opposed to say vibe coding)

[-]

LetsGoBrandon4256@reddit

It's hilarious that Qwen, a Chinese model, writes worse Chinese proses in RP than Gemma by default (minimal prompting).

[-]

WakanaeShion@reddit

I am using the models `gemma-4-26B-A4B-it-UD-IQ2_M` and `Qwen3.6-35B-A3B-UD-IQ1_M_MTP`. with stock llama.cpp.
Qwen slightly exceeds my video memory capacity (16GB).

During testing with OpenCLI calls and the enterprise knowledge base RAG, I found that Gemma has slightly better comprehension ability.
In my own examples, Gemma’s visual component (mmproj) also performed slightly better at image recognition without misidentifying things.

As for programming, I think small models are not up to the task. Even online DeepSeek struggles with complex architectural thinking — only GPT can sustain a decent conversation.

[-]