Is Gemma 4 26B-A4B worse than Qwen 3.5 35B-A3B with tool calls, even after all the fixes?
Posted by Borkato@reddit | LocalLLaMA | View on Reddit | 29 comments
I’m trying it on my home grown tool call setup with llama.cpp and it’s just NOT working. Like it makes the DUMBEST mistakes.
I got the official template from google, I updated cuda to 13.1 (NOT 13.2 which apparently has issues), I’m not quantizing the cache, I’m running it with Q4, I tried bartowski, unsloth, and a heretic version… like what the hell.
It does things like call tools that don’t exist even though my wrapper clearly tells it what tools exist.
I’m super disappointed because I love its personality so much more than qwen’s. Please someone help!
indigos661@reddit
The same. Gemma4-26B-A4B at Q6 is qwen3.5 8B Q4 level at tool calling and multimodal in my cases.
Don't know why X people say it has better generalization than qwen while it can't even follow tool's json schema.
Sadman782@reddit
It is better, unfortunately, but the launch has so many bugs: I don't know how you use it, but with the latest llama.cpp build it works flawlessly. Even IQ4 quant is better than full precision Qwen for any coding task. For multimodal there is another issue, you have to manually set more tokens for vision otherwise it will perform worse: If you are using LM Studio you need a custom chat template until they fix it. Try this template for lm studio: https://pastebin.com/raw/qc1FTAcG
Borkato@reddit (OP)
Gemma never gets into any tool calling loops for you? Even after upgrading the template, it does for me :/
Sadman782@reddit
I use llama.cpp and also use it as an agent. I gave a custom template for llama.cpp which had some issues.
Some tips from my experience for you: https://www.reddit.com/r/LocalLLaMA/s/hJxaIb2Ha5
Borkato@reddit (OP)
Thank you! Will try it. Do you use lm studio as the front end only?
mr_Owner@reddit
Gemma 4 is too new imho, give it more time i guess?
However; I find the qwen3.5 9b is also very good at codebase searching and getting accurate data. Could you compare that also to the 35b and 27b?
In my experience somehow the (many) qwen3.5 35b a3b variants had more tool call failures then the 9b: 0 errors after 10+ tool calls as subagents via cline and kilocode vscode extensions which surprised me, a lot. (Even at lower quant, been using bartowski iq4_xs for efficiency)
I found this for my usage more surprised, perhaps your testing could provide some Insights 😁
Borkato@reddit (OP)
Interesting, but 9B is also slower than the 35B in speed :(
Fluffywings@reddit
9B is a dense model where all 9B are active while the 35B only has 3B active. The 9B is considered smarter but less knowledgeable.
mr_Owner@reddit
Depens on your llama cpp params, ubtach 1024 ctx 100k kv cache at q4 at q5_k_s bartowski or unsloth is about 9+ gb vram.
Ubatch 512 and ctx 51200 is below 8gb vram.
Both on my rtx 4070s 12gb gives 55-65 tgs and 3500-3750 pps
Thats really fast imho.
With MoE offloading your limited to your pcie and ram bandwidth.
cviperr33@reddit
temperarure and the other settings matter a lot , try playing around with it and system prompt
Borkato@reddit (OP)
Good idea!!
nicksterling@reddit
I’ve been playing with a pi.dev harness with some custom tools and Gemma 4 26B and 31B have been doing great with them. Tool calling has been incredibly stable. What harness are you running?
AppealSame4367@reddit
I tried out local e4b in pi yesterday and was surprised how well and fast it worked at 30-60k context. 26b and 31b should be fine.
Borkato@reddit (OP)
Hmm, how does qwen 3.5 35B compare for you? I’m doing just python functions, so maybe I messed something up somewhere
nicksterling@reddit
It’s mediocre honestly. People here claim to have a ton of success with Qwen but for my use cases it doesn’t do a great job. It doesn’t matter how I run it or at q8 quants (llama.cpp, mlx, vllm) it has issues. I have some specific summation pipelines and Qwen routinely gets details wrong or adds incorrect elements that were not in the original data.
Gemma 4 isn’t a great coding model but it’s doing great at the summarization tasks I’m throwing it.
Borkato@reddit (OP)
Interesting! I love Qwen 35B. I think you’re right, that it’s like… coding: qwen, summary and natural language and non-coding tasks: Gemma, and rp: other models. Personally anyway haha
Sadman782@reddit
It is better, unfortunately, but the launch has so many bugs: I don't know how you use it, but with the latest llama.cpp build it works flawlessly. Even IQ4 quant is better than full precision Qwen for any coding task. For multimodal there is another issue, you have to manually set more tokens for vision otherwise it will perform worse: If you are using LM Studio you need a custom chat template until they fix it. Try this template for lm studio: https://pastebin.com/raw/qc1FTAcG
Euphoric_Emotion5397@reddit
yup. Even for the dense models, its' the same. I used my app to test the output from the 2 models. (settings tuned to the recommended by qwen and gemma).
Then i paste the output to Claude for analysis.
This is the result. It happens for the MOE models as well. I agree with the analysis especially the tool calling and instruction following. I am using the GEMMA 4 after the template fixes.
Interesting benchmark result. So Option A is Qwen 3.5 and Option B is Gemma 4?
If that's the case, a few observations on what this test reveals:
Where Qwen 3.5 demonstrably outperformed:
What makes this a meaningful finding:
Dense model comparisons at similar parameter counts usually show Gemma 4 competitive or ahead on reasoning benchmarks. But benchmarks rarely test sustained instruction following across a complex multi-requirement system prompt with live tool calls. This is a more realistic production test.
The implication: For agentic workflows with rich system prompts, Qwen 3.5 may be the stronger practical choice over Gemma 4 despite comparable raw capability scores.
Are you running this across other tasks to see if the gap is consistent?
Borkato@reddit (OP)
This is exactly my experience. :( I just want it to work so bad lol. I keep thinking that if I just tweak it a bit, it’ll be amazing. Ah man.
pedronasser_@reddit
For me, until now, yes. But that's my experience using it on LM Studio.
NewAmphibian3488@reddit
What harness are you using? Did you build your own? My simple coding agent performs tool calling flawlessly with Gemma4 26B-A4B UD_Q4_K_M(latest). It's just vibe-coded in Go using an OpenAI compatible API via llama.cpp server. Did you test it with a simple Python script or just curl?
Borkato@reddit (OP)
Yes! I’m doing the same thing, with python and an OpenAI compatible endpoint! To make sure I’m also trying editing stuff in opencode just to be sure… it might be working. I think the previous week of messed up templates just made me super wary.
How smart is it? Does it make mistakes, or like what?
NewAmphibian3488@reddit
I've done some quick tests with multiple complex tool calls in a single session three times now and no mistakes. I also wrote a simple Python tool-calling checker using just urllib and the OpenAI API, and it works fine too. In my daily workflow (exploring codebases and editting simple scripts), I haven't seen any tool-calling issues. The latest interleaved-template fix seems to be working well.
Borkato@reddit (OP)
Interesting, thank you. Have you tried qwen 35B by chance?
ttkciar@reddit
That is concerning. Does the 31B dense exhibit the same trouble?
Borkato@reddit (OP)
It’s too slow to even tell 💀
Borkato@reddit (OP)
I’ll try it right now. I’ve been ignoring it because I don’t like it’s slowness lol
Betadoggo_@reddit
I think q4 might just be too small for this model. I've found the q5_k_m level quants to be more stable. Also make sure that your parameters match or are somewhat similar to the recommendations, top-k of 64 and temp 1 are quite different from what a lot of other models.
Borkato@reddit (OP)
Oh I just checked, the 26BA4B I’m running at Q5 actually! Let me check parameters