Comparing Qwen3.5 27B vs Gemma 4 31B for agentic stuff
Posted by takoulseum@reddit | LocalLLaMA | View on Reddit | 27 comments
Models compared:
- Qwen3.5-27B-UD-Q5_K_XL
- gemma-4-31B-it-UD-Q5_K_XL
Main flags for boths
--flash-attn on \
--n-gpu-layers 99 \
--no-mmap \
-c 150000 \
--temp 1 --top-p 0.9 --min-p 0.1 --top-k 20 \
--ctx-checkpoints 1 \
--jinja \
-np 1 \
--reasoning on \
--mmproj 'mmproj-BF16.gguf' \
--image-min-tokens 300 --image-max-tokens 512
I know they may not be the best and I still need more experiments (thank you u/Sadman782) I find these tests fun and interesting.
| Model | Observations |
|---|---|
| Qwen3.5-27B-UD-Q5_K_XL | More steps, checks env var, corrects its fails to fully address the requests so final results is good (in the example, the telegram message is perfect), sometimes create a python script instead of bash only |
| gemma-4-31B-it-UD-Q5_K_XL | More direct (smarter to finds urls) but may miss the final goal (in this example the telegram message was truncated |
Please let me know if you need more tests.
dsartori@reddit
In limited testing I find the Google model really falls apart on long context in a way that Qwen3.5 doesn't. Still good, still worth using for its speed, but I need to switch to a larger, smarter model to clean up its mess from time to time. Which doesn't happen with Qwen3.5-35B.
BigYoSpeck@reddit
Which quant, are you quantising KV cache, and how large a context are we talking?
I'm running Q6_K_L and no quantisation on the KV cache. I've found it still performing well up to 120k context so far iterating through minor fixes and improvements on a task
dsartori@reddit
Q8, no quantization of the KV cache. Tends to fall apart around 180-200k.
d4nger_n00dle@reddit
Do you mean 27B?
dsartori@reddit
Yes that's right.
Kodix@reddit
Make sure you're using the up-to-date Gemma download with the fixed jinja template.
If you do run more tests, try Gemma with 1.5 temperature, folk-knowledge seems to imply it performs better there.
I'm really quite curious on whether that's *actually* the case.
(Also also: the unsloth recommended top-k for gemma is 60. For what that's worth).
seamonn@reddit
Gemma 4 31b support in llama.cpp still feels weak. I think they are figuring out the kinks.
Meanwhile, it's way stable on Ollama for some reason.
Usually, it's the opposite.
Give it a go, right now, Gemma 4 31b works better on Ollama.
I have this use case where it makes a huge call - think 50-60 functions calls + 128k+ tokes in a single chat. Ollama Gemma 4 31b handled it like a champ. Llama.cpp gave up after a few calls.
rm-rf-rm@reddit
I wont touch ollama with a 10ft pole again.
AvidCyclist250@reddit
i agree. llama.cpp and gemma web mcp is a clusterfuck
takoulseum@reddit (OP)
That’s interesting, even if I will not install ollama it may be worth to find why such difference. FYI I see same results with vllm and AWQ quantz
AvidCyclist250@reddit
gemma is smarter for me but web mcp is no total no-go. so unfortunately, i must resort to qwen.
Adorable_Arrival_666@reddit
Been using Qwen3.5:27b (and trying Qwopus3.5:27b) and having good results
adel_b@reddit
in my tests, gemma vs qwen but different models... for agentics... gemma do better just for its tool calling capability
in my tests, I made it use chrome mcp to perform specific tasks online, in one case, the web site had a search bar only appears if you click a search icon, it has nothing to indicate it is search button, gemma figured it out, qwen did not
misha1350@reddit
Gemma 4 does not, in fact, work as well with tool calling. Its usecases are for translating moon runes and some coding, but Qwen3.5 shines in most other cases.
misha1350@reddit
Consider throwing Qwen3.5 9B into the mix with the same quantisation as well. It would be able to fit basically anywhere and run on a base model M4 Mac Mini, so I wonder how well it would stack up against both of these models and if the final result would be up to par.
vocAiInc@reddit
the self-correction behavior in qwen3.5 is actually really useful for agentic tasks where you can afford the extra steps -- a model that catches its own mistake before finishing beats a faster model that misses the final goal, especially when you're not there to review mid-execution. the gemma truncation issue sounds like a context budget problem more than a reasoning one though. what context length are you running these at and does the truncation happen consistently or only on longer chains?
Speedz007@reddit
You're comparing MoE to Dense. Apples to oranges.
TimWardle@reddit
Both dense
Speedz007@reddit
My bad! Should hit the sack.
Aggressive-Permit317@reddit
Solid test man! Agentic stuff is where these models actually show their personality. Qwen going the extra steps, checking env vars, and self-correcting to actually finish the job (that Telegram message being perfect) is exactly why I’ve been leaning on it lately too. Gemma feels quicker on the initial “find the URL” part but yeah, it sometimes just… stops before the goal line.
Have you thrown either one at a longer multi-step chain yet (like full research → script → execute → report back)? Or is the 150k context + those flags already pushing the limit?
takoulseum@reddit (OP)
Thanks! I take your suggestion to do some longer tests, adding complexity gradually. For the context, I just put a random window for now.
Aggressive-Permit317@reddit
Sweet, thanks for the update! Starting gradual with the random window is definitely the smart move. Curious how it holds up once you layer in the full research → script → execute chain. I’ve had Qwen pull through on some longer 80k+ context stuff before but Gemma 4 felt a bit more consistent on the shorter agentic runs. Let me know how the longer tests go, always hunting for real-world data points on this stuff.
GroundbreakingMall54@reddit
honestly in my experience they trade blows depending on the task. qwen3.5 is better at following complex multi step instructions but gemma 4 handles tool calling more reliably. neither one is a clear winner which is kind of the point lol
takoulseum@reddit (OP)
Yes that is my general feeling right now, there is no clean winner and the whole depends on tasks. But I’ll need more complex tasks to really judge
Overall-Somewhere760@reddit
Wondering sbout --ctx-checkpoints 1 . Whats that doing specifically? I saw on journalctl logs something like checkpoint 1 of 32...5 of 32, etc, but can't figure whats the use of if
Just_Maintenance@reddit
Without that it uses hundreds of gigabytes of system memory in checkpoints, and will get OOM killed if you don’t have enough memory.
takoulseum@reddit (OP)
Answer from u/Sadman782
https://www.reddit.com/r/LocalLLaMA/comments/1sieiqu/comment/ofjjdh8/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button