I'm shocked (Gemma 4 results)
Posted by Potential-Gold5298@reddit | LocalLLaMA | View on Reddit | 41 comments

12.Gemma 4 31B (think) in Q4_K_M local - 78.7%.
16.Gemini 3 Flash (think) - 76.5%
19.Claude Sonnet 4 (think) - 74.7%
22.Claude Sonnet 4.5 (no think) - 73.8%
24.Gemma 4 31B (no think) in Q4_K_M local - 73.5%.
29.GPT-5.4 (Think) - 72.8%
vulcan4d@reddit
It may be high in the benchmark but it still is a 31b multimodel LLM and doesn't have much knowledge. This means you need to have it fetch data online a lot. The cloud services excel at this, local models rely on tools that honestly ain't very good IMO.
rm-rf-rm@reddit
I love this - personal benchmarks are the way to go and OP has systematized his + shared it with the world (which is like 3 steps ahead of everyone else, at most people make a post on this sub with sparse info).
OP, consider open sourcing the code for the table - hopefully it will elicit others to publicize their rankings
Kaljuuntuva_Teppo@reddit
Uhh, what is this benchmark supposed to be? Both Opus 4.5 and 4.6 are marked as #8.
Opus 4 is beating both of those models... Yeah no.
SteppenAxolotl@reddit
Use which ever model is best for your personal workload.
llama-impersonator@reddit
learn about variance, bruh
Stetto@reddit
learn about repetition and the law of large numbers, bruh
llama-impersonator@reddit
when you see 4 entries clustering in a benchmark with very similar scores like this:
you should not immediately interpret this as "Opus 4.1 is better"
Aggressive_Special25@reddit
How do you run this with open claw. It doesent seem to work
the_mighty_skeetadon@reddit
Works seamlessly with ollama if you don't want to tinker with setup
Aggressive_Special25@reddit
Lm studio?
Ardalok@reddit
It’s funny that Gemma is outperforming Flash. I wonder how many parameters it actually has? Maybe it’s like Gemma 26b-A4b - an MoE with less than 5B active parameters?
petuman@reddit
Gemma 26B-A4B is surely Flash-lite/nano range.
Flash/mini models seem to have around 8-16B activation and 100-200B total. Indirectly "proven" by being highly contested format, there's lots of open releases with that formula.
dryadofelysium@reddit
Google said that the next Gemini Nano is going to be based on Gemma 4 E4B.
You are correct otherwise though, if I had to guess I'd say Flash Lite is around 20-30B.
petuman@reddit
I've meant in OpenAI tier naming. Google's flash-lite = OpenAI nano Google's flash = OpenAI mini
Potential-Gold5298@reddit (OP)
I roughly estimate the number of model parameters based on the ratio of knowledge to hallucinations. Typically, models of similar size either have a high number of correct answers but also a high number of hallucinations (Qwen3.5 397BA17B) or fewer correct answers and fewer hallucinations (MiMo-V2-Flash). Judging by this data, the Gemini 3 Flash is very close to the Gemini 3 Pro.
Ardalok@reddit
Perhaps the only thing that's different is the parameter activation? It could be a memory-saving technique where it’s the same model, just with a different count of active parameters.
Cool-Chemical-5629@reddit
I am not shocked at all. Gemma has a word "Gem" in its name, after all.
_derpiii_@reddit
Can someone explain why Sonnett 4.6 is scoring so much higher than Opus 4.6?
Right_Weird9850@reddit
Just chess?
Right_Weird9850@reddit
One day I'm really going to see what this benchmarks measure.
sandman_br@reddit
Lame benchmarking. Can’t be taken seriously
ZealousidealTurn218@reddit
GPT-4T over 5.4?
Lorian0x7@reddit
Censorship level is pretty accurate
Technical-Earth-3254@reddit
Is there a way to exclude Reasoning from the calculation? Or just more than just censoring? Interesting find for sure.
Lorian0x7@reddit
there is the non thinking version in the list
deejeycris@reddit
Fake news
Mart-McUH@reddit
That is strange benchmark indeed, probably not enough tasks to actually give relevant results.
I can only speak for local, but Qwen 3.5 27B with reasoning looks smarter (picks up lot of small details/instructions Gemma4 31B misses). Gemma4 writes nicer though (more natural/pleasant language, though still lot of slop).
Also, with Qwen 3.5 I felt like reasoning only worked well with Q8 and perhaps Q6, below it started to be visibly worse. With Gemma 4 after Q8 and Q6 I am also trying Q4KL from bartowski and so far it seems to perform reasonably well.
Potential-Gold5298@reddit (OP)
There's no single benchmark that demonstrates a model's intelligence across all tasks. Some models are better at some tasks than others. There are specific benchmarks for coding, agents, and so on. The author of this leaderboard tests models on his own personal tasks, not official benchmarks. This isn't definitive (like any benchmark)—it's just food for thought. You should use the model that performs best on your specific tasks, not the one that performs best on someone else's tests.
zeitplan@reddit
Which Quant did you choose?
Potential-Gold5298@reddit (OP)
I usually download the mrademacher quants (regular, not i1). Specifically, I download Gemma 4 from DevQuasar. However, I am not the author of the leaderboard, if that's what you mean.
Uninterested_Viewer@reddit
This is why they decided not to give us the >100b model.
P-S-E-D@reddit
Why do you think older models are scoring higher in STEM than SOTA models?
TonyPace@reddit
I am just spitballing here, but I imagine modern models are using something similar to abliteration during or just after training, where they run a lot of queries through it, and stuff that almost never comes up gets zapped. That is a lot of science stuff - and likely dumb hobby and culture stuff. A year and a half ago, they just tried to shovel in more data.
Potential-Gold5298@reddit (OP)
If it's within the same model line (like GPT5/5.1/5.2/5.4), then I can assume that it's due to ‘fine-tuning’ in favor of coding, agency/tool usage, safety, or something similar.
edeltoaster@reddit
My experience was not that good in practice, even after the first round of fixes. I used the unsloth q4_k_xl variant (also second iteration) in LM Studio. still had strange bugs, tool calling was still error-prone in Roo Code etc probably the llama.cpp. Any reason to change the quant?
Potential-Gold5298@reddit (OP)
I don't use iMatrix quanta because I work with models in languages other than English, and for me, iMatrix would only degrade their quality. It's safe to assume that iMatrix could degrade other aspects of the model that weren't accounted for in the importance matrix. You could try Q5_K_M without iMatrix (if possible) – its quality should be on par with or better than Q4_K_M with iMatrix, but more predictable. Or even Q4_K_M without iMatrix.
ComprehensiveBed5368@reddit
I agree with you. imatrix degread the capabilities of the model in other languages. Which developers beside mradermacher release non-imatrix gguf quants?
Potential-Gold5298@reddit (OP)
I'm not 100% sure, but as far as I can tell, this is lmstudio-community and ggml-org. I downloaded Gemma 4 26B-A3B it from DevQuasar. I didn't find any mention of iMatrix in their description – please correct me if I'm wrong. Also, official quants from developers and fine-tuners usually don't include iMatrix. As far as I understand, importance matrices are favored by "professional quantizers" because it's a personal touch and a way to stand out, while a standard quants is the same for everyone.
Dunkle_Geburt@reddit
I've just had time to play a little with it but boy, this thing is crazy good 😲
Direct_Technician812@reddit
Qwen 27B 💀💀💀
Warm-Attempt7773@reddit
I'm experiencing that level as well. The benchmark is pretty in point