Gemma 4 MOE is very bad at agentic coding. Couldn't do things CLine + Qwen can do.
Posted by Voxandr@reddit | LocalLLaMA | View on Reddit | 29 comments
Qwen 3 Coder Next never have this problems.

Gemma4 is failing hard
Dangerous-Truth5113@reddit
looks more like agent fail. I'm using Zed and Gemma 4 does pretty good VS Qwen3.5-9B
Voxandr@reddit (OP)
looks like it fixed now. gonna try
Pattinathar@reddit
Running Gemma 4 26B-A4B Q4_K_M on 32GB RAM. gpu_layers=0 is mandatory on 4GB VRAM crashed with even 1 layer. MoE expert layers are too large for consumer GPUs. CPU-only gives \~5 min responses but quality is solid.
benevbright@reddit
The same. tested with coding agent yesterday with latest lm studio and the result was very very disappointing. Still qwen3-coder-next is the best... (on my 64GB Mac Studio)
Voxandr@reddit (OP)
Today this is merged , lets see how it fair : https://github.com/ggml-org/llama.cpp/pull/21534
benevbright@reddit
thx for the info!
Deep_Ad1959@reddit
agentic coding is one of the hardest benchmarks for any model because it requires sustained tool-use over many turns without losing context. i've been working on desktop automation agents and the gap between models that can reliably chain 10+ tool calls vs ones that fall apart after 3 is massive. it's not just about raw intelligence, it's about how well the model was trained on the tool-use loop specifically.
Neo2066@reddit
hey man sincerely thank you for this , i was coding circles trying to automate a browser thank you for sharing.
Voxandr@reddit (OP)
looks like thats why coder shine.
NNN_Throwaway2@reddit
Pretty sure llama.cpp is still broken. There was just a new release so maybe it finally works.
neverbyte@reddit
llama.cpp is still broken. I'm not sure why more people aren't talking about it. Doesn't matter if you download a release or git pull the latest, it still has some kind of, IMO, tokenizer problem. Your agent will have an existential crisis trying to make sense of what is wrong and will fail tool calls. I am 100% confident gemma 4 will be an amazing agent once proper fixes merge into llama.cpp.
Voxandr@reddit (OP)
version: 8665 (b8635075f)
Voxandr@reddit (OP)
let me check what llamacpp i am using . ( using latest docker pull)
llama-impersonator@reddit
i use the interleaved chat template (models/templates/google-gemma-4-31B-it-interleaved.jinja) and the 31b is working quite well after b8665's updated parser
Voxandr@reddit (OP)
Gonna check but 31 B is too slow on strix halo
llama-impersonator@reddit
it's pretty slow even on 3090s.
JohnMason6504@reddit
MoE models need different prompting for agentic workloads. The routing layer decides which experts activate per token, and tool-call JSON can land on suboptimal expert paths if your system prompt is not structured right. Try explicit XML-style tool schemas instead of free-form JSON. Qwen3 dense models avoid this because every param sees every token. Not a model quality issue, it is a routing architecture issue.
Voxandr@reddit (OP)
Any pointers on it?
RedParaglider@reddit
Nobody is beating qwen 3 coder next 89b on the desktop for what it does. And if I'm honest I can't believe Qwen released it at all. Coding is one thing these companies don't want people doing on their own, they want that sweet enterprise cash
Voxandr@reddit (OP)
So they are really keeping it gated?? Any news source?
JohnMason6504@reddit
MOE routing is the bottleneck for agentic tasks. The model needs to pick the right expert on every token, and tool-use prompts are out of distribution for most training mixes. Total params matter less than how well the router was trained on structured output.
Finanzamt_Endgegner@reddit
Qwen 3 Coder Next is 80b this is 26b lol, also its probably still broken in your inference engine
Voxandr@reddit (OP)
both are MOE
Finanzamt_Endgegner@reddit
sure but it has 3x the total parameters thats gonna help a LOT
Simple-Worldliness33@reddit
What quant are you using ? I didn't have this kind of issue a lot with llama.cpp (after fixing template and vram) Sometimes it happens also with qwen3.5. Il using mostly q4 or q6 depending of the context
Voxandr@reddit (OP)
Bartoksi Q8.
StardockEngineer@reddit
Never in Next? You just of used to later in is existence because it was brutal for a quite a while.
Voxandr@reddit (OP)
i see , i started using it recently ( 3 weeks ago)
StardockEngineer@reddit
Yeah, you skipped all the pain and complaints. Used to miserably fail at took calls until big patches were pushed to llama.cpp