Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.

Posted by MiaBchDave@reddit | LocalLLaMA | View on Reddit | 40 comments

Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article... and it did not disappoint. Lots of free info in the link. The juicy part is behind a paywall. I'll respect that, but the short of it is:

It's showing that the Qwen's are more benchmaxxed, and Gemma 4 31B is far more efficient with token use. So even though Gemma is a little slower for inference because of its size, you're basically getting things done much faster. This is confirming my own use, so now really looking forward to DFlash in Gemma, MTP, and any other optimizations arriving soon.

[-]

ambient_temp_xeno@reddit

lol this

was not surprising.

[-]

Sadman782@reddit

Yeah, the Qwen team optimizes for benchmarks. Other than a better by default frontend (they are RLmaxxed for this) which Gemma 4 can achieve with a simple system prompt, I find them worse than Gemma for literally anything else: raw coding, translation, general knowledge, etc.

[-]

Johnwascn@reddit

My feeling is exactly the opposite of yours; GEMMA4's instruction compliance is much worse than qwen3.6, especially in terms of code writing.

[-]

CharacterAnimator490@reddit

Maybe im biased, but i feel the same.

[-]

rpkarma@reddit

My problem is that 31B is notably slower than 27B 3.6 on my Spark. Hoping the new MTP drafter helps.

[-]

BoobooSmash31337@reddit

With how Chinese culture can be. I really do wonder. I don't mean this at all in a racist way. I'm sure a ton of our best ML engineers are Chinese Americans.

[-]

illusionmist@reddit

Well leaderboards are simply too easy to game especially when you have lots of same-minded people/bots. LLM ranking, App Store ranking, Product Hunt ranking, and X/Reddit voting.

I've seen it happen too many times. I've also seen how many tutorials/groups/services they offer to 打榜 (boost ranking) for you. When got called out just repeat the words racist, Sinophobia, etc. and westerners will scatter. Easy.

[-]

slower-is-faster@reddit

I knew it

[-]

rhythmdev@reddit

less-is-more

[-]

fatboy93@reddit

can confirm

[-]

ptear@reddit

[-]

MiaBchDave@reddit (OP)

Username confirms 😂

[-]

LORD_CMDR_INTERNET@reddit

Anecdotally, for coding, I find Qwen3.6 27B and Gemma4 31B trade blows. I will swap Plan/Act roles if either gets stuck and that seems to work quite well.

[-]

mortenmoulder@reddit

Did you try Qwen3 Next Coder? I find it pretty good, but haven't compared it to Gemma4 yet.

[-]

LORD_CMDR_INTERNET@reddit

I dunno, I could never get it to run faster than 5-10 tokens/sec on my 5090 using llama.cpp

[-]

ResidentPositive4122@reddit

I will swap

Funny enough this was one finding ~last year from hf: swapping randomly between opus and gpt5 in the same session led to better results than any of them separately.

[-]

Dany0@reddit

I keep thinking about that research where it showed that having a small llm "optimise" typical prompts improved user feedback except on prompts which were crafted by professionals

But iirc it wasn't coding

[-]

Badger-Purple@reddit

And now Hermes and other agents have a skill where they will do exactly this, and it improves with more input.

[-]

Dany0@reddit

link?

[-]

Newtonip@reddit

https://github.com/nousresearch/hermes-agent

[-]

Dany0@reddit

I keep thinking about it because there was that other paper that showed that typos, small inconsistencies and "misdirections" (gaslighting the model basically) reduced quality. While there was that other paper that prompting in multiple languages improved quality, and repeating the same prompt twice improved adherence

So if we could inject a small finetuned llm to fix up typos, translate random parts, and duplicate random parts before sending it off... that would be a fun experiment to do

[-]

SkyFeistyLlama8@reddit

The same thing happens with the MOE models too. Qwen 3.6 35B overthinks like crazy, spewing double or triple the thinking token count compared to Gemma 4 26B. When the results are close to each other, I'll stick to Gemma for a faster total reply time.

[-]

Teslaaforever@reddit

Qwen 35b works very good for me on strix halo, I used some paramater recommended for coding

[-]

suprjami@reddit

This guy's conclusions from his own data are cooked.

Shows results where Qwen 3.5 does marginally better on THREE benchmarks out of twelve, one by the tiniest amount.

"Qwen3.6 performs behind Qwen3.5"

No it doesn't. Learn to read a fucking bar graph. Winning nine out of twelve comparisons is not "performing behind". It's the opposite.

He has made similar statements on earlier posts too. As far as I am concerned this whole blog is junk not worth reading.

[-]

DinoAmino@reddit

I also see he leads with "On benchmarks, it (Qwen 3.6) appears to be significantly stronger than Qwen3.5." But actually reading the bar graph shows only 3 where it is significantly stronger. Actually reading the bar graph shows 3.6 losing 5 of them. Geez the more I look the more I realize you can't fucking read a bar graph yourself.

[-]

iplaythisgame2@reddit

I have felt that gemma 4 worked "wmoother" at the tool calls and various tasks I've used. I just have a hell of a time keep it loaded. Crashes a lot on me. 2x 3090 and 3060. Anyone wants to share a llama.cpp config thats solid, please do.

[-]

CharlesStross@reddit

"wmoother"

I know it was a typo but the quotation marks made me laugh lol

[-]

GrungeWerX@reddit

I use Qwen 3.6 27b no thinking. Works great so far, never even needed to turn it on.

[-]

BigYoSpeck@reddit

They definitely have different strengths and weaknesses depending upon the scope of the task

Qwen waffles for sure with its thinking and it genuinely needs the context size efficiency it has because it will happily reach 200k context working on something that Gemma is at less than 100k for

But I find Qwen sticks to doing what it needs, viewing files relevant to the task. Gemma is currently on the 2nd time around reading my entire codebase because I'm fairly sure it forgot it had already read everything

[-]

Paerrin@reddit

Agreed. Qwen 3.6 with thinking off has been more consistent with tool calling and getting the usage right. Gemma4 kills it on non-structured text and natural language text processing and summarization on my system.

Running both through Hermes and llama-swap as the frontend for llama.cpp.

[-]

Raredisarray@reddit

Your been using qwen 27b without thinking? How’s your experience ?

[-]

GrungeWerX@reddit

It’s great, very solid. I’ve never even turned it on because I haven’t needed to. Which is encouraging since Im sure thinking would make it even smarter. But right now it is handling everything I throw at it.

[-]

Pristine-Woodpecker@reddit

Claims that Qwen is benchmaxxed don't hold up to real world testing or SWE-Rebench.

[-]

Interpause@reddit

it can both be benchmaxxed and good to use, as they say AGI is an LLM trained on every benchmark that could theoretically exist

[-]

GovernmentTechnical@reddit

They have different types of attention so they work well for different use cases

[-]

BoobooSmash31337@reddit

Genuinely interesting point.

[-]

ea_man@reddit

* gemma-4-31B.i1-IQ4_XS.gguf is 16.7 GB

* Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf is 14.7 GB

Also QWEN take less VRAM for KV cache so I'd say Gemma is not really a competitor in the dense space for those with 16GB.

[-]

Chromix_@reddit

I'm using Qwen 3.6 27B over Gemma 4 31B for local coding. It might simply work better for me as Gemma 4 is way more sensitive to quantization than Qwen 3.6. So for Qwen I can use a smaller quant and Q8 KV to get more context, without much degradation.

[-]

WouterC@reddit

Idem

[-]

jacek2023@reddit

That confirms my experiences. One more time real usage beats benchmarks.