Dense Model Shoot-Off: Gemma 4 31B vs Qwen3.6/5 27B... Result is Slower is Faster.
Posted by MiaBchDave@reddit | LocalLLaMA | View on Reddit | 40 comments
Not affiliated with Kaitchup, but a fan of their testing. I was looking forward to this article... and it did not disappoint. Lots of free info in the link. The juicy part is behind a paywall. I'll respect that, but the short of it is:
It's showing that the Qwen's are more benchmaxxed, and Gemma 4 31B is far more efficient with token use. So even though Gemma is a little slower for inference because of its size, you're basically getting things done much faster. This is confirming my own use, so now really looking forward to DFlash in Gemma, MTP, and any other optimizations arriving soon.
ambient_temp_xeno@reddit
lol this
was not surprising.
Sadman782@reddit
Yeah, the Qwen team optimizes for benchmarks. Other than a better by default frontend (they are RLmaxxed for this) which Gemma 4 can achieve with a simple system prompt, I find them worse than Gemma for literally anything else: raw coding, translation, general knowledge, etc.
Johnwascn@reddit
My feeling is exactly the opposite of yours; GEMMA4's instruction compliance is much worse than qwen3.6, especially in terms of code writing.
CharacterAnimator490@reddit
Maybe im biased, but i feel the same.
rpkarma@reddit
My problem is that 31B is notably slower than 27B 3.6 on my Spark. Hoping the new MTP drafter helps.
BoobooSmash31337@reddit
With how Chinese culture can be. I really do wonder. I don't mean this at all in a racist way. I'm sure a ton of our best ML engineers are Chinese Americans.
illusionmist@reddit
Well leaderboards are simply too easy to game especially when you have lots of same-minded people/bots. LLM ranking, App Store ranking, Product Hunt ranking, and X/Reddit voting.
I've seen it happen too many times. I've also seen how many tutorials/groups/services they offer to 打榜 (boost ranking) for you. When got called out just repeat the words racist, Sinophobia, etc. and westerners will scatter. Easy.
slower-is-faster@reddit
I knew it
rhythmdev@reddit
less-is-more
fatboy93@reddit
can confirm
ptear@reddit
Oh
MiaBchDave@reddit (OP)
Username confirms 😂
LORD_CMDR_INTERNET@reddit
Anecdotally, for coding, I find Qwen3.6 27B and Gemma4 31B trade blows. I will swap Plan/Act roles if either gets stuck and that seems to work quite well.
mortenmoulder@reddit
Did you try Qwen3 Next Coder? I find it pretty good, but haven't compared it to Gemma4 yet.
LORD_CMDR_INTERNET@reddit
I dunno, I could never get it to run faster than 5-10 tokens/sec on my 5090 using llama.cpp
ResidentPositive4122@reddit
Funny enough this was one finding ~last year from hf: swapping randomly between opus and gpt5 in the same session led to better results than any of them separately.
Dany0@reddit
I keep thinking about that research where it showed that having a small llm "optimise" typical prompts improved user feedback except on prompts which were crafted by professionals
But iirc it wasn't coding
Badger-Purple@reddit
And now Hermes and other agents have a skill where they will do exactly this, and it improves with more input.
Dany0@reddit
link?
Newtonip@reddit
https://github.com/nousresearch/hermes-agent
Dany0@reddit
I keep thinking about it because there was that other paper that showed that typos, small inconsistencies and "misdirections" (gaslighting the model basically) reduced quality. While there was that other paper that prompting in multiple languages improved quality, and repeating the same prompt twice improved adherence
So if we could inject a small finetuned llm to fix up typos, translate random parts, and duplicate random parts before sending it off... that would be a fun experiment to do
SkyFeistyLlama8@reddit
The same thing happens with the MOE models too. Qwen 3.6 35B overthinks like crazy, spewing double or triple the thinking token count compared to Gemma 4 26B. When the results are close to each other, I'll stick to Gemma for a faster total reply time.
Teslaaforever@reddit
Qwen 35b works very good for me on strix halo, I used some paramater recommended for coding
suprjami@reddit
This guy's conclusions from his own data are cooked.
Shows results where Qwen 3.5 does marginally better on THREE benchmarks out of twelve, one by the tiniest amount.
"Qwen3.6 performs behind Qwen3.5"
No it doesn't. Learn to read a fucking bar graph. Winning nine out of twelve comparisons is not "performing behind". It's the opposite.
He has made similar statements on earlier posts too. As far as I am concerned this whole blog is junk not worth reading.
DinoAmino@reddit
I also see he leads with "On benchmarks, it (Qwen 3.6) appears to be significantly stronger than Qwen3.5." But actually reading the bar graph shows only 3 where it is significantly stronger. Actually reading the bar graph shows 3.6 losing 5 of them. Geez the more I look the more I realize you can't fucking read a bar graph yourself.
iplaythisgame2@reddit
I have felt that gemma 4 worked "wmoother" at the tool calls and various tasks I've used. I just have a hell of a time keep it loaded. Crashes a lot on me. 2x 3090 and 3060. Anyone wants to share a llama.cpp config thats solid, please do.
CharlesStross@reddit
"wmoother"
I know it was a typo but the quotation marks made me laugh lol
GrungeWerX@reddit
I use Qwen 3.6 27b no thinking. Works great so far, never even needed to turn it on.
BigYoSpeck@reddit
They definitely have different strengths and weaknesses depending upon the scope of the task
Qwen waffles for sure with its thinking and it genuinely needs the context size efficiency it has because it will happily reach 200k context working on something that Gemma is at less than 100k for
But I find Qwen sticks to doing what it needs, viewing files relevant to the task. Gemma is currently on the 2nd time around reading my entire codebase because I'm fairly sure it forgot it had already read everything
Paerrin@reddit
Agreed. Qwen 3.6 with thinking off has been more consistent with tool calling and getting the usage right. Gemma4 kills it on non-structured text and natural language text processing and summarization on my system.
Running both through Hermes and llama-swap as the frontend for llama.cpp.
Raredisarray@reddit
Your been using qwen 27b without thinking? How’s your experience ?
GrungeWerX@reddit
It’s great, very solid. I’ve never even turned it on because I haven’t needed to. Which is encouraging since Im sure thinking would make it even smarter. But right now it is handling everything I throw at it.
Pristine-Woodpecker@reddit
Claims that Qwen is benchmaxxed don't hold up to real world testing or SWE-Rebench.
Interpause@reddit
it can both be benchmaxxed and good to use, as they say AGI is an LLM trained on every benchmark that could theoretically exist
GovernmentTechnical@reddit
They have different types of attention so they work well for different use cases
BoobooSmash31337@reddit
Genuinely interesting point.
ea_man@reddit
* gemma-4-31B.i1-IQ4_XS.gguf is 16.7 GB
* Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf is 14.7 GB
Also QWEN take less VRAM for KV cache so I'd say Gemma is not really a competitor in the dense space for those with 16GB.
Chromix_@reddit
I'm using Qwen 3.6 27B over Gemma 4 31B for local coding. It might simply work better for me as Gemma 4 is way more sensitive to quantization than Qwen 3.6. So for Qwen I can use a smaller quant and Q8 KV to get more context, without much degradation.
WouterC@reddit
Idem
jacek2023@reddit
That confirms my experiences. One more time real usage beats benchmarks.