Gemma 4 coding performance, do different harnesses give wildly different results?

Posted by jazir55@reddit | LocalLLaMA | View on Reddit | 7 comments

So the question I've seen posed many times in /r/singularity is if the Gemini models are actually that bad at coding compared to their benchmarks, or whether the harness used makes an absolutely gigantic difference in model performance.

Given Gemma 4 is from Google as well, I'm wondering if anyone has benchmarked Gemma 4's coding performance comparing scores with the harnesses used, the only variation between tests being the harness specifically.

I have to assume, based on just logic here, that Gemma 4 is going to have massive swings in performance given what harness was used (E.g. KiloCode vs RooCode vs OpenCode vs Claude Code, etc).

So my question to /r/localllama is, has that held up for you? Are there really wild variations in performance based on purely the structure given to Gemma? If so, in your own tests, which harness has had the best results?

Further, assuming any of you have done those tests, how does Gemma 4 in the best harness compare to Qwen 3.6 in your evaluations?

[-]

JamesEvoAI@reddit

It's also worth mentioning that the Gemma 4 release involved a lot of patches to llama.cpp and the popular GGUF's (like those from unsloth) which were skewing peoples initial perception of the generation quality.

That said the harness definitely makes a difference not just for Gemma, but every model. Check out Wolfbench for clear data on how much difference the harness can make:

https://wolfbench.ai/

o5mfiHTNsH748KVq@reddit

Regardless of model, your agent is only as good as your harness.

ttkciar@reddit

On one hand, Gemma 4 31B has been really good at codegen tasks (better than Qwen3.5-27B, but I haven't evaluated Qwen3.6 yet at all, so dunno about that).

On the other hand, Gemma 4 still has some tool-using problems, where it looks like it's about to infer a tool-call and then inference stops prematurely instead.

This is a lot better than it used to be; both Google and llama.cpp have issued bugfixes which mostly fix it, but it wouldn't surprise me if some applications trigger the failure mode more frequently than others.

When tool-using works correctly, Gemma 4 performs codegen tasks slightly better than Gemini 3.1 Pro, though it is hindered somewhat by its lower context limit (256K tokens vs 1M tokens). When tool-using does not work correctly, it can be pretty horrible.

Hopefully a more comprehensive Gemma 4 tool-using fix will arrive soon. Until then, it's a bit of a gamble.

tome571@reddit

This has been my experience as well. I think Gemini's harness/scaffolding makes it worse for certain tasks. Gemma 4 31B is super smart and it's been genuinely impressive with some homegrown workflow around it

The failure mode kind of acts like it's trying to infer a token which Google pruned from the model vocabulary, but that's speculation.

I've been wondering if fine-tuning the model on well-formed tool-use might help avoid the failure mode, but I can't personally prioritize trying it right now.

false79@reddit

Gemma 4 gives pretty decent results. 26B is fast but from time to time, I can catch it getting loopy.

Then I load 31B. It's slower but it gets it right.

These benchmarks, I honnestly don't have any respect for them. They are not a representative of the tasks that I do on a daily basis.

You really don't know the value of an LLM until you putting through the paces of your own workflow.

Badger-Purple@reddit

I mean this makes sense to an extent, different harnesses have different tools and system prompts which tweak the way decode will occur, and that can make a model’s response better or worse, but I would venture to say “to an extent” meaning the models without a harness can be tested as well and the results should scale accordingly.