Gemma 4 MOE is very bad at agentic coding. Couldn't do things CLine + Qwen can do.

[-]

Dangerous-Truth5113@reddit

looks more like agent fail. I'm using Zed and Gemma 4 does pretty good VS Qwen3.5-9B

[-]

Running Gemma 4 26B-A4B Q4_K_M on 32GB RAM. gpu_layers=0 is mandatory on 4GB VRAM crashed with even 1 layer. MoE expert layers are too large for consumer GPUs. CPU-only gives \~5 min responses but quality is solid.

[-]

benevbright@reddit

The same. tested with coding agent yesterday with latest lm studio and the result was very very disappointing. Still qwen3-coder-next is the best... (on my 64GB Mac Studio)

[-]

Voxandr@reddit (OP)

Today this is merged , lets see how it fair : https://github.com/ggml-org/llama.cpp/pull/21534

[-]

benevbright@reddit

thx for the info!

[-]

Deep_Ad1959@reddit

agentic coding is one of the hardest benchmarks for any model because it requires sustained tool-use over many turns without losing context. i've been working on desktop automation agents and the gap between models that can reliably chain 10+ tool calls vs ones that fall apart after 3 is massive. it's not just about raw intelligence, it's about how well the model was trained on the tool-use loop specifically.

[-]

Neo2066@reddit

hey man sincerely thank you for this , i was coding circles trying to automate a browser thank you for sharing.

[-]

Voxandr@reddit (OP)

looks like thats why coder shine.

[-]

NNN_Throwaway2@reddit

Pretty sure llama.cpp is still broken. There was just a new release so maybe it finally works.

[-]

neverbyte@reddit

llama.cpp is still broken. I'm not sure why more people aren't talking about it. Doesn't matter if you download a release or git pull the latest, it still has some kind of, IMO, tokenizer problem. Your agent will have an existential crisis trying to make sense of what is wrong and will fail tool calls. I am 100% confident gemma 4 will be an amazing agent once proper fixes merge into llama.cpp.

[-]

Voxandr@reddit (OP)

version: 8665 (b8635075f)

[-]

Voxandr@reddit (OP)

let me check what llamacpp i am using . ( using latest docker pull)

[-]

llama-impersonator@reddit

i use the interleaved chat template (models/templates/google-gemma-4-31B-it-interleaved.jinja) and the 31b is working quite well after b8665's updated parser

[-]

Voxandr@reddit (OP)

Gonna check but 31 B is too slow on strix halo

[-]

llama-impersonator@reddit

it's pretty slow even on 3090s.

[-]

JohnMason6504@reddit

MoE models need different prompting for agentic workloads. The routing layer decides which experts activate per token, and tool-call JSON can land on suboptimal expert paths if your system prompt is not structured right. Try explicit XML-style tool schemas instead of free-form JSON. Qwen3 dense models avoid this because every param sees every token. Not a model quality issue, it is a routing architecture issue.

[-]

Voxandr@reddit (OP)

Any pointers on it?

[-]

RedParaglider@reddit

Nobody is beating qwen 3 coder next 89b on the desktop for what it does. And if I'm honest I can't believe Qwen released it at all. Coding is one thing these companies don't want people doing on their own, they want that sweet enterprise cash

[-]

Voxandr@reddit (OP)

So they are really keeping it gated?? Any news source?

[-]

JohnMason6504@reddit

MOE routing is the bottleneck for agentic tasks. The model needs to pick the right expert on every token, and tool-use prompts are out of distribution for most training mixes. Total params matter less than how well the router was trained on structured output.

[-]

Finanzamt_Endgegner@reddit

Qwen 3 Coder Next is 80b this is 26b lol, also its probably still broken in your inference engine

[-]

Voxandr@reddit (OP)

both are MOE

[-]

Finanzamt_Endgegner@reddit

sure but it has 3x the total parameters thats gonna help a LOT

[-]

Simple-Worldliness33@reddit

What quant are you using ? I didn't have this kind of issue a lot with llama.cpp (after fixing template and vram) Sometimes it happens also with qwen3.5. Il using mostly q4 or q6 depending of the context

[-]

Voxandr@reddit (OP)

Bartoksi Q8.

[-]

StardockEngineer@reddit

Never in Next? You just of used to later in is existence because it was brutal for a quite a while.

[-]

Voxandr@reddit (OP)

i see , i started using it recently ( 3 weeks ago)

[-]

StardockEngineer@reddit

Yeah, you skipped all the pain and complaints. Used to miserably fail at took calls until big patches were pushed to llama.cpp