Gemma 4 31B > Kimi K2.5 > Grok 4.20 on DuelLab's highest reasoning leaderboard
Posted by Goa_@reddit | LocalLLaMA | View on Reddit | 6 comments
Gemma 4 31B: 53.9 score
Kimi K2.5: 50.5 score
Grok 4.20: 46.8 score
Funny to see the open Gemma 4 31B ahead of both.
Note, these scores are about writing competitive code...
Awkward-Boat1922@reddit
Anyone used this locally yet?
I've been putting off the download but it is surely not better than 122b?
gamblingapocalypse@reddit
I’m using Gemma 4 31B right now, and for my use cases I actually prefer it over Qwen 3.5 122B.
What stands out to me is that Gemma 4 31B tends to get what I’m asking for right on the first try, whereas Qwen will often get there eventually but usually needs more steering.
As for prompt processing, I haven’t measured it yet, but they seem fairly similar to me, maybe with a slight edge to Qwen. In the decoding phase, Qwen also feels faster. But honestly, Gemma’s outputs have been good enough that I still prefer it overall.
Gemma can be pretty aggressive with RAM usage. It may start off lighter, but once you actually work with it for a while, you can watch the RAM disappear. My understanding is that this is tied to how Gemma handles KV cache. I’ve heard turbo quant is strongly recommended to help reduce KV-cache memory usage, but I haven’t implemented it yet. Even so, I’m convinced enough by Gemma 4 that it’s an optimization path I’m willing to pursue.
Qwoctopussy@reddit
122b has only 10b active, which degrades its ability to understand interactions between distant tokens — this is what i’ve gathered anyway — so although it has more knowledge, it has less intelligence
Randomdotmath@reddit
i don't trust duellab's compares since they put GPT-5.4 Nano > GPT-5.4 Mini > Claude Sonnet 4.6, this has no sense
Goa_@reddit (OP)
It's a small sample and what you see in the Overall leaderboard has it's nuancies. But, 5.4 Mini and Nano are way more capable than any one yet realized...
Robert__Sinclair@reddit
I tested gemma-4 26B and gemma-4 31B the first one can't solve a simple logic problem which is not in any dataset.
the second one solved it.
As a comparison: gemini 1.5 flash couldn't solve it. Gemini 2.5 flash solved it randomly one time over 4...
Gemini 3 (and 3.1) solve it too.
Too bad that the 31B is quite big.
Can't wait to see a 14B or even an 8B model solve it.