Comparing Qwen3.5 27B vs Gemma 4 31B for agentic stuff

Posted by takoulseum@reddit | LocalLLaMA | View on Reddit | 27 comments

Models compared:

Qwen3.5-27B-UD-Q5_K_XL
gemma-4-31B-it-UD-Q5_K_XL

Main flags for boths

--flash-attn on \

--n-gpu-layers 99 \

--no-mmap \

-c 150000 \

--temp 1 --top-p 0.9 --min-p 0.1 --top-k 20 \

--ctx-checkpoints 1 \

--jinja \

-np 1 \

--reasoning on \

--mmproj 'mmproj-BF16.gguf' \

--image-min-tokens 300 --image-max-tokens 512

I know they may not be the best and I still need more experiments (thank you u/Sadman782) I find these tests fun and interesting.

Model	Observations
Qwen3.5-27B-UD-Q5_K_XL	More steps, checks env var, corrects its fails to fully address the requests so final results is good (in the example, the telegram message is perfect), sometimes create a python script instead of bash only
gemma-4-31B-it-UD-Q5_K_XL	More direct (smarter to finds urls) but may miss the final goal (in this example the telegram message was truncated

Please let me know if you need more tests.

[-]

dsartori@reddit

In limited testing I find the Google model really falls apart on long context in a way that Qwen3.5 doesn't. Still good, still worth using for its speed, but I need to switch to a larger, smarter model to clean up its mess from time to time. Which doesn't happen with Qwen3.5-35B.

[-]

BigYoSpeck@reddit

Which quant, are you quantising KV cache, and how large a context are we talking?

I'm running Q6_K_L and no quantisation on the KV cache. I've found it still performing well up to 120k context so far iterating through minor fixes and improvements on a task

[-]

dsartori@reddit

Q8, no quantization of the KV cache. Tends to fall apart around 180-200k.

[-]

d4nger_n00dle@reddit

Do you mean 27B?

[-]

dsartori@reddit

Yes that's right.

[-]

Kodix@reddit

Make sure you're using the up-to-date Gemma download with the fixed jinja template.

If you do run more tests, try Gemma with 1.5 temperature, folk-knowledge seems to imply it performs better there.

I'm really quite curious on whether that's *actually* the case.

(Also also: the unsloth recommended top-k for gemma is 60. For what that's worth).

[-]

seamonn@reddit

Gemma 4 31b support in llama.cpp still feels weak. I think they are figuring out the kinks.

Meanwhile, it's way stable on Ollama for some reason.

Usually, it's the opposite.

Give it a go, right now, Gemma 4 31b works better on Ollama.

I have this use case where it makes a huge call - think 50-60 functions calls + 128k+ tokes in a single chat. Ollama Gemma 4 31b handled it like a champ. Llama.cpp gave up after a few calls.

[-]

rm-rf-rm@reddit

I wont touch ollama with a 10ft pole again.

[-]

AvidCyclist250@reddit

i agree. llama.cpp and gemma web mcp is a clusterfuck

[-]

takoulseum@reddit (OP)

That’s interesting, even if I will not install ollama it may be worth to find why such difference. FYI I see same results with vllm and AWQ quantz

[-]

AvidCyclist250@reddit

gemma is smarter for me but web mcp is no total no-go. so unfortunately, i must resort to qwen.

[-]

Adorable_Arrival_666@reddit

Been using Qwen3.5:27b (and trying Qwopus3.5:27b) and having good results

[-]

adel_b@reddit

in my tests, gemma vs qwen but different models... for agentics... gemma do better just for its tool calling capability

in my tests, I made it use chrome mcp to perform specific tasks online, in one case, the web site had a search bar only appears if you click a search icon, it has nothing to indicate it is search button, gemma figured it out, qwen did not

[-]

misha1350@reddit

Gemma 4 does not, in fact, work as well with tool calling. Its usecases are for translating moon runes and some coding, but Qwen3.5 shines in most other cases.

[-]

misha1350@reddit

Consider throwing Qwen3.5 9B into the mix with the same quantisation as well. It would be able to fit basically anywhere and run on a base model M4 Mac Mini, so I wonder how well it would stack up against both of these models and if the final result would be up to par.

[-]

vocAiInc@reddit

the self-correction behavior in qwen3.5 is actually really useful for agentic tasks where you can afford the extra steps -- a model that catches its own mistake before finishing beats a faster model that misses the final goal, especially when you're not there to review mid-execution. the gemma truncation issue sounds like a context budget problem more than a reasoning one though. what context length are you running these at and does the truncation happen consistently or only on longer chains?

[-]

Speedz007@reddit

You're comparing MoE to Dense. Apples to oranges.

[-]

TimWardle@reddit

Both dense

[-]

Speedz007@reddit

My bad! Should hit the sack.

[-]

Aggressive-Permit317@reddit

Solid test man! Agentic stuff is where these models actually show their personality. Qwen going the extra steps, checking env vars, and self-correcting to actually finish the job (that Telegram message being perfect) is exactly why I’ve been leaning on it lately too. Gemma feels quicker on the initial “find the URL” part but yeah, it sometimes just… stops before the goal line.

Have you thrown either one at a longer multi-step chain yet (like full research → script → execute → report back)? Or is the 150k context + those flags already pushing the limit?

[-]

takoulseum@reddit (OP)

Thanks! I take your suggestion to do some longer tests, adding complexity gradually. For the context, I just put a random window for now.

[-]

Aggressive-Permit317@reddit

Sweet, thanks for the update! Starting gradual with the random window is definitely the smart move. Curious how it holds up once you layer in the full research → script → execute chain. I’ve had Qwen pull through on some longer 80k+ context stuff before but Gemma 4 felt a bit more consistent on the shorter agentic runs. Let me know how the longer tests go, always hunting for real-world data points on this stuff.

[-]

GroundbreakingMall54@reddit

honestly in my experience they trade blows depending on the task. qwen3.5 is better at following complex multi step instructions but gemma 4 handles tool calling more reliably. neither one is a clear winner which is kind of the point lol

[-]

takoulseum@reddit (OP)

Yes that is my general feeling right now, there is no clean winner and the whole depends on tasks. But I’ll need more complex tasks to really judge

[-]

Overall-Somewhere760@reddit

Wondering sbout --ctx-checkpoints 1 . Whats that doing specifically? I saw on journalctl logs something like checkpoint 1 of 32...5 of 32, etc, but can't figure whats the use of if

[-]

Just_Maintenance@reddit

Without that it uses hundreds of gigabytes of system memory in checkpoints, and will get OOM killed if you don’t have enough memory.

[-]

takoulseum@reddit (OP)

Answer from u/Sadman782

https://www.reddit.com/r/LocalLLaMA/comments/1sieiqu/comment/ofjjdh8/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button