I'm shocked (Gemma 4 results)

[-]

vulcan4d@reddit

It may be high in the benchmark but it still is a 31b multimodel LLM and doesn't have much knowledge. This means you need to have it fetch data online a lot. The cloud services excel at this, local models rely on tools that honestly ain't very good IMO.

[-]

rm-rf-rm@reddit

I love this - personal benchmarks are the way to go and OP has systematized his + shared it with the world (which is like 3 steps ahead of everyone else, at most people make a post on this sub with sparse info).

OP, consider open sourcing the code for the table - hopefully it will elicit others to publicize their rankings

[-]

Kaljuuntuva_Teppo@reddit

Uhh, what is this benchmark supposed to be? Both Opus 4.5 and 4.6 are marked as #8.
Opus 4 is beating both of those models... Yeah no.

[-]

SteppenAxolotl@reddit

Small-scale manual performance comparison benchmark I made for myself. This table showcases the results I recorded of various AI models across different personal tasks I encountered over time (currently 83). I use a weighted rating system and calculate the difficulty for each tasks by incorporating the results of all models. This is particularly relevant in scoring when failing easy questions or passing hard ones.

NOTE, THAT THIS JUST ME SHARING THE RESULTS FROM MY OWN SMALL-SCALE PERSONAL TESTING. YMMV! OBVIOUSLY THE SCORES ARE JUST THAT AND MIGHT NOT REFLECT YOUR OWN PERSONAL EXPERIENCES OR OTHER WELL-KNOWN BENCHMARKS.

Use which ever model is best for your personal workload.

[-]

llama-impersonator@reddit

learn about variance, bruh

[-]

Stetto@reddit

learn about repetition and the law of large numbers, bruh

[-]

llama-impersonator@reddit

when you see 4 entries clustering in a benchmark with very similar scores like this:

Model	Score
Opus 4.1	81.8
Opus 4.0	81.5
Opus 4.5	81.2
Opus 4.6	81.2

you should not immediately interpret this as "Opus 4.1 is better"

[-]

Aggressive_Special25@reddit

How do you run this with open claw. It doesent seem to work

[-]

the_mighty_skeetadon@reddit

Works seamlessly with ollama if you don't want to tinker with setup

[-]

Aggressive_Special25@reddit

Lm studio?

[-]

Ardalok@reddit

It’s funny that Gemma is outperforming Flash. I wonder how many parameters it actually has? Maybe it’s like Gemma 26b-A4b - an MoE with less than 5B active parameters?

[-]

petuman@reddit

Gemma 26B-A4B is surely Flash-lite/nano range.

Flash/mini models seem to have around 8-16B activation and 100-200B total. Indirectly "proven" by being highly contested format, there's lots of open releases with that formula.

[-]

dryadofelysium@reddit

Google said that the next Gemini Nano is going to be based on Gemma 4 E4B.

You are correct otherwise though, if I had to guess I'd say Flash Lite is around 20-30B.

[-]

petuman@reddit

I've meant in OpenAI tier naming. Google's flash-lite = OpenAI nano Google's flash = OpenAI mini

[-]

Potential-Gold5298@reddit (OP)

I roughly estimate the number of model parameters based on the ratio of knowledge to hallucinations. Typically, models of similar size either have a high number of correct answers but also a high number of hallucinations (Qwen3.5 397BA17B) or fewer correct answers and fewer hallucinations (MiMo-V2-Flash). Judging by this data, the Gemini 3 Flash is very close to the Gemini 3 Pro.

[-]

Ardalok@reddit

Judging by this data, the Gemini 3 Flash is very close to the Gemini 3 Pro.

Perhaps the only thing that's different is the parameter activation? It could be a memory-saving technique where it’s the same model, just with a different count of active parameters.

[-]

Cool-Chemical-5629@reddit

I am not shocked at all. Gemma has a word "Gem" in its name, after all.

[-]

_derpiii_@reddit

Can someone explain why Sonnett 4.6 is scoring so much higher than Opus 4.6?

[-]

Right_Weird9850@reddit

Just chess?

[-]

Right_Weird9850@reddit

One day I'm really going to see what this benchmarks measure.

[-]

sandman_br@reddit

Lame benchmarking. Can’t be taken seriously

[-]

ZealousidealTurn218@reddit

GPT-4T over 5.4?

[-]

Lorian0x7@reddit

Censorship level is pretty accurate

[-]

Technical-Earth-3254@reddit

Is there a way to exclude Reasoning from the calculation? Or just more than just censoring? Interesting find for sure.

[-]

Lorian0x7@reddit

there is the non thinking version in the list

[-]

deejeycris@reddit

Fake news

[-]

Mart-McUH@reddit

That is strange benchmark indeed, probably not enough tasks to actually give relevant results.

I can only speak for local, but Qwen 3.5 27B with reasoning looks smarter (picks up lot of small details/instructions Gemma4 31B misses). Gemma4 writes nicer though (more natural/pleasant language, though still lot of slop).

Also, with Qwen 3.5 I felt like reasoning only worked well with Q8 and perhaps Q6, below it started to be visibly worse. With Gemma 4 after Q8 and Q6 I am also trying Q4KL from bartowski and so far it seems to perform reasonably well.

[-]

Potential-Gold5298@reddit (OP)

There's no single benchmark that demonstrates a model's intelligence across all tasks. Some models are better at some tasks than others. There are specific benchmarks for coding, agents, and so on. The author of this leaderboard tests models on his own personal tasks, not official benchmarks. This isn't definitive (like any benchmark)—it's just food for thought. You should use the model that performs best on your specific tasks, not the one that performs best on someone else's tests.

[-]

zeitplan@reddit

Which Quant did you choose?

[-]

Potential-Gold5298@reddit (OP)

I usually download the mrademacher quants (regular, not i1). Specifically, I download Gemma 4 from DevQuasar. However, I am not the author of the leaderboard, if that's what you mean.

[-]

Uninterested_Viewer@reddit

This is why they decided not to give us the >100b model.

[-]

P-S-E-D@reddit

Why do you think older models are scoring higher in STEM than SOTA models?

[-]

TonyPace@reddit

I am just spitballing here, but I imagine modern models are using something similar to abliteration during or just after training, where they run a lot of queries through it, and stuff that almost never comes up gets zapped. That is a lot of science stuff - and likely dumb hobby and culture stuff. A year and a half ago, they just tried to shovel in more data.

[-]

Potential-Gold5298@reddit (OP)

If it's within the same model line (like GPT5/5.1/5.2/5.4), then I can assume that it's due to ‘fine-tuning’ in favor of coding, agency/tool usage, safety, or something similar.

[-]

edeltoaster@reddit

My experience was not that good in practice, even after the first round of fixes. I used the unsloth q4_k_xl variant (also second iteration) in LM Studio. still had strange bugs, tool calling was still error-prone in Roo Code etc probably the llama.cpp. Any reason to change the quant?

[-]

Potential-Gold5298@reddit (OP)

I don't use iMatrix quanta because I work with models in languages other than English, and for me, iMatrix would only degrade their quality. It's safe to assume that iMatrix could degrade other aspects of the model that weren't accounted for in the importance matrix. You could try Q5_K_M without iMatrix (if possible) – its quality should be on par with or better than Q4_K_M with iMatrix, but more predictable. Or even Q4_K_M without iMatrix.

[-]

ComprehensiveBed5368@reddit

I agree with you. imatrix degread the capabilities of the model in other languages. Which developers beside mradermacher release non-imatrix gguf quants?

[-]

Potential-Gold5298@reddit (OP)

I'm not 100% sure, but as far as I can tell, this is lmstudio-community and ggml-org. I downloaded Gemma 4 26B-A3B it from DevQuasar. I didn't find any mention of iMatrix in their description – please correct me if I'm wrong. Also, official quants from developers and fine-tuners usually don't include iMatrix. As far as I understand, importance matrices are favored by "professional quantizers" because it's a personal touch and a way to stand out, while a standard quants is the same for everyone.

[-]