Qwen3.6-35B-A3B vs Gemma4-26B-A4B
Posted by MarcCDB@reddit | LocalLLaMA | View on Reddit | 70 comments
Just wondering how are people's experience with both these models!
I've had some nice results with Qwen but Gemma4 runs so much faster here. I'm using a Radeon 9070 XT and always latest llama.cpp.
Endurance_Beast@reddit
In linguistics, gemma4 26b has the upperhand, in everything else the qwen 35b is better.
Much-Researcher6135@reddit
Ditto psychology.
Endurance_Beast@reddit
So u noticed?
Much-Researcher6135@reddit
I did. :)
I tested four models: Gemma4 dense/MoE, Qwen3.6 dense/MoE. The Gemma models were best at analyzing my art from a Jungian perspective (oddly, the MoE was better than dense) and discussing that type of thing with a bit of depth. Not as deep a comprehension as frontier models, but hey it's private. There's only so much I'm going to tell a model from OpenAI, Anthropic or Grok, no matter how clever they are.
Embarrassed-Area4652@reddit
Ditto in which direction?
Much-Researcher6135@reddit
apologies, see edit
speedb0at@reddit
qwen is overwhelmingly better
Borkato@reddit
Except for RP/writing/prose/summarizing/natural English
speedb0at@reddit
I use my shit to code not play pretend
ai_without_borders@reddit
running qwen3.6 35b q6 for most coding and agentic stuff, gemma4 26b for quick summarization where i need the throughput. main difference i notice is tool call reliability — qwen is rock solid across long sessions, gemma starts hallucinating tool schemas around context 60-80k. bartowski q6 over unsloth UD4 for me, the context degradation with MTP quants is real on longer tasks
AcrobaticChain1846@reddit
Except gemma I agree on majority of it.
I haven't used gemma much.
For bartowski MTP what are the max MTP token you set? For unsloth it's suggested to use 6
I tried same 6 for bartowski but acceptance rate is like 50-70%
pmttyji@reddit
Right now, most people are busy with Qwen3.6 models after recent MTP feature merged on llama.cpp. Once this PR merged, they would play with Gemma-4 models side by side with Qwen3.6.
Much-Researcher6135@reddit
HYPE!
superdariom@reddit
Except that MTP for the 35b model really doesn't give any speed up at all. 27b different story...
10F1@reddit
I gained 20 tps on 35b using vulkan on 7900xtx
superdariom@reddit
Yes sorry I'm running 35b at Q8 on the same GPU as you. It also slows down prompt processing.
10F1@reddit
Q8 would offload to the CPU, a lot.
I'm running q4_k_s, getting ~110tps with 128k ctx.
Client_Hello@reddit
I did see about +30% when enabling MTP for 35b, definitely worth it. The people seeing little benefit likely have layers split between cpu and gpu.
27b shows a larger boost here since it has to run on gpu. I got +80% ON 27b
TonyStark1500@reddit
I’ve been testing out different models lately, I have an RTX 4090 and have Hermes setup. How can I decide between Qwen3.6 27B vs 35B A3B? Seems so tough, both have pros and cons from the research I’ve done so far?
Borkato@reddit
27B blows 35B out of the water when coding. Only run 35B for simple coding tasks, but even then I’d say it’ll frustrate you so often you should just stick with the slower 27B because at least it’s almost always right.
35B is a great for “I just need an answer, it’s really not that hard” kind of questions. Like “what are some good words to describe food for my game, random, comma separated list” or “refactor this function so that the parameters have sensible defaults” or “finish this simple bash script, I forgot how to do the image part”
27B is great for every kind of coding. I love it so much. I don’t use either with thinking btw.
And be sure you’re using MTP!
If you’re roleplaying or writing or summarizing, use Gemma instead. But for anything coding use qwen - though Gemma is ok at it, ish. I find it takes up much more space and is just overall less satisfactory for coding for some reason
superdariom@reddit
27b at Q4 hallucinated a security flaw in Aws configuration that caused me to waste several hours trying to troubleshoot it. 35b at Q8 didn't make the same mistake.
TonyStark1500@reddit
Fascinating, thanks for the reply! I’ve been mostly using Qwen3.6:27B lately, I think I agree it’s the best, and yes I’ve been using MTP! I’ve been struggling with how much context to allow, OpenCode seems to work pretty well with like 64K but I’m not sure Hermes likes that as much, but I’m not really sure. But was wondering if context wouldn’t matter as much if I switched to the MoE model since it’s so much faster tokens per second.
Borkato@reddit
Hmmm that I don’t know! I roll my own incredibly simple framework, I don’t trust opencode. They don’t even let you disable telemetry completely, it still phones home to the Llm list. But that’s a me problem haha.
I’ve heard pi has shorter system prompts! I’ve never used it though
TonyStark1500@reddit
Oh wait this is great to know, I thought OpenCode was fully local and private!! I’ll have to look into Pi and maybe other alternatives!!
simon_zzz@reddit
I use both regularly on 24GB VRAM.
Qwen is better, qualitatively. Better at following instructions and calling tools.
Gemma is fast but I only use it for summarization.
FerLuisxd@reddit
Gemma is faster? How mhch faster
Much-Researcher6135@reddit
Here's my speed test on a 3090 with the gemma4/qwen3.6 models:
Client_Hello@reddit
Gemma 26B can be faster on 24gb because its easier to fit entirely in VRAM. Things flip if you can run Qwen 35B in VRAM. It becomes faster due to fewer active parameters.
Capable-Guide98@reddit
Newbie here but my understanding was that Gemma was a moe model, so it's execution is faster as it does not have to evaluate all of the parameters. Is it not the case?
CooperDK@reddit
Gemma is also better for general chat and roleplay.
nickm_27@reddit
I see a number of people saying Gemma does not do well with tool calling but that is not my experience, using it for tool calling like web search, weather forecasts, place searching, device control, etc. it works flawlessly.
For me Qwen works well at this too except it is way too chatty and refuses to follow instructions about being brief / concise responses, making it much more frustrating to use for my use cases.
Gemma follows instructions much more naturally and easily.
TheHiveFather@reddit
Run both, like both probably lean more towards Qwen, but run my own interface so I cant speak to off the shelf. The differences are subtle I would say, it comes down to tasks and expectations more than capability. I try to test them based on my work flows within specified roles so they essentially finding the best use case in my stack.
anykeyh@reddit
KV cache on Qwen is so much better;
On my setup (strix), because of that I can run 4\~6 parallel workflow and get saturated GPU compute instead of being memory bound, so \~85tps overall.
With Gemma, KV cache increase way faster with context because it's a different architecture. Qwen stays under 10G for 256k context Q8 (Gated Delta Net); I cannot do that with Gemma, it uses way more memory once context starts to fill.
Loud-Swim-2932@reddit
I am running Gemma-4-31B-IT-NVFP4 with MTP assistant on a dgx spark. Pretty fast and reliable for OCR, translations, text corrections and basic tool usage. In my simple test setup it performs better then Qwen.
For me - both failed at coding.
reddit_kwr@reddit
Tired benchmarking both on BIRD SQL in as identical a condition as we could. Gemma beats Qwen by a margin. (20% vs 12%).
This is a more complex benchmark with multiple steps, exploring the environment, actively asking for feedback and making complex plans. It's not a simple QnA benchmark.
Was absolutely not expecting this. Both were on 4 bit quantized. Reasons could be:
If people are interested I can look at the data and see where qwen fails more often. But also, would love to have someone give me the best qwen config to run the eval with. I have 1x5090
Borkato@reddit
From ooba’s test, Gemma does horrendously with quantization while qwen does much better. May want to try comparing larger quant if you can.
I feel like we really need a unified place for benchmarks lol
blackhawkx12@reddit
Qwen all the way, its actually really interesting for chinese open source model now really fights to par, cause competitiveness mean better options for customer.
Rektile142@reddit
I use Gemma 4 26B A4B as my daily driver in 4 bit quant, and I have not had a single issue. It chains tool calls flawlessly, although I do not stress the context window (~100k max).
It trades blows with Qwen 3.6 27B for my use case, which is web research and system execution via command line use. For coding only, folks seem to prefer the Qwen models.
trentvb@reddit
Just my experience: running both MOEs with unsloths q4x on 12gb vram, qwen3.6 is faster. However, I've found it not very good at writing Go and prefer gemma4 in that area.
Designer_Elephant227@reddit
I use 35b Q5 and 26b Q4. I got many problems with tool calls with Gemma and literally none with qwen.
TheDapperYank@reddit
How well do they do with looping? I alternate between Qwen3.6 35b and 27b, and I find that 35b tends to loop on medium length tasks (anything longer than \~32k context). Not sure if that's a feature of it being a small MoE model.
Gesha24@reddit
Try Gemma with opencode - it does work with its tools. I still prefer Qwen, but sometimes it's nice to have a different model available. That said, at least mine Gemma turns into a pumpkin by 100K context and starts struggling with tools again
okoyl3@reddit
I ran unsloth Qwen3.6 35b-a3b UD4 xl with opencode, felt like Claude code.
SlechteConcentratie@reddit
On which hardware?
jacknjill101@reddit
I’m running it on a M4 Mac mini 64gb.
AcrobaticChain1846@reddit
I switched to bartowski as I was having troubles with unsloth's MTP models as the context was getting filled the tk/s drop was significant.
The bartowski q6 doesn't have that much degradation gives consistent 60-90 tk/s whereas unsloth's drop is like 30-40 tk/s as context is getting full.
ea_man@reddit
Barto IQ3 quants are preatty interesting for anyone who wants fast reply with little VRAM as \~16GB, go check the tensors:
jacknjill101@reddit
That’s why you should as part of your workflow compact your session.
Potential-Leg-639@reddit
Correct answer. Best quant!
Running UD-Q8 recently and cant really spot a difference to UD-Q4 tbh.
Jorlen@reddit
I recently switched to Pi / Pi.dev which is a CLI-based super lightweight coding agent. I'm not a big CLI (command-line interface) dev yet, still prefer IDE like VsCodium but thankfully there is a vscode extension called "Pi Agent for Vs code" that links vscode directly to Pi once it's installed. I set it to only do edit, write and read calls and with this, Gemma 4 seems to work perfectly.
I think Gemma4 stumbles on very complex stuff like continue.dev and so on but if you give it a minimalist setup, it works really well (so far - about one hour in tests).
fatboy93@reddit
Wow, thats the exact opposite for me. Qwen just runs into a single token print with "!" on both GGUFs and MLX for me :/
Gemma runs decently.
JsThiago5@reddit
Here i use qwen 27b dense for coding, sometimes 35b if i need speed, and gemma for everything else.
ideal2545@reddit
Gemma as a daily ai, discusser, idea generator blah blah, code with qwen
xandep@reddit
-- "Love with your Gemma, use your Qwen for everything else"
Sisaroth@reddit
i see a Captain Disillusion fan
No_Swimming6548@reddit
Gemma for RP, Qwen for everything else
ceo_of_banana@reddit
RP?
VoidAlchemy@reddit
role play (narrative chat workload as opposed to say vibe coding)
LetsGoBrandon4256@reddit
It's hilarious that Qwen, a Chinese model, writes worse Chinese proses in RP than Gemma by default (minimal prompting).
WakanaeShion@reddit
I am using the models `gemma-4-26B-A4B-it-UD-IQ2_M` and `Qwen3.6-35B-A3B-UD-IQ1_M_MTP`. with stock llama.cpp.
Qwen slightly exceeds my video memory capacity (16GB).
During testing with OpenCLI calls and the enterprise knowledge base RAG, I found that Gemma has slightly better comprehension ability.
In my own examples, Gemma’s visual component (mmproj) also performed slightly better at image recognition without misidentifying things.
As for programming, I think small models are not up to the task. Even online DeepSeek struggles with complex architectural thinking — only GPT can sustain a decent conversation.
Ariquitaun@reddit
For chatbot functionality Gemma is nicer, but qwen is better at tool calls which you need, since at that model size their knowledge base is blurry at best and being able to search and fetch from the Internet helps to plug that gap.
bhagathgoud99@reddit
Which one is best?
OhShitOhFuckOhMyGod@reddit
For non-coding Gemma is better in my testing.
Icy-Degree6161@reddit
Yes. For IT related text translation it's miles better for me. Especially European langs.
CooperDK@reddit
I agree. There is a qwen-claude mix which I need to test for coding
Dance-Till-Night1@reddit
I use llms mainly for language learning and scientific biological/health/medical queries and gemma seems slightly smarter.
cu-pa@reddit
qwen + hermes, no problem at all
floriandotorg@reddit
For my personal experience, I’m a bit disappointed by Gemma. Qwen feels much better, not only for coding, but for everything.
leonbollerup@reddit
qwen.. gemma spend to much memory tools calls is unstable.. at best..
ElChupaNebrey@reddit
Qwen