Gemma 4 coding performance, do different harnesses give wildly different results?

Posted by jazir55@reddit | LocalLLaMA | View on Reddit | 7 comments

So the question I've seen posed many times in /r/singularity is if the Gemini models are actually that bad at coding compared to their benchmarks, or whether the harness used makes an absolutely gigantic difference in model performance.

Given Gemma 4 is from Google as well, I'm wondering if anyone has benchmarked Gemma 4's coding performance comparing scores with the harnesses used, the only variation between tests being the harness specifically.

I have to assume, based on just logic here, that Gemma 4 is going to have massive swings in performance given what harness was used (E.g. KiloCode vs RooCode vs OpenCode vs Claude Code, etc).

So my question to /r/localllama is, has that held up for you? Are there really wild variations in performance based on purely the structure given to Gemma? If so, in your own tests, which harness has had the best results?

Further, assuming any of you have done those tests, how does Gemma 4 in the best harness compare to Qwen 3.6 in your evaluations?