Gemma 4 31B — 4bit is all you need

Posted by tolitius@reddit | LocalLLaMA | View on Reddit | 72 comments

Gemma quant comparison on M5 Max MacBook Pro 128GB (subjective of course, but on variety of categories):

[gemma 4 leaderboard](

the surprising bit: Gemma 4 31B 4bit scored higher than 8bit. 91.3% vs 88.4%. not sure why: could be the template, could be quantization, could be my prompts. but it was consistent across runs

[accuracy vs. tokens per second](

[category accuracy](

"Gemma 4 26B-A4B would get a higher score but for two questions it went into the regression loop and never came back, all the quants as well as full precision (bf16):

[24B-A4B failing some tests due to regression loops](

I configured "16,384" max response tokens and it hit that max while looping:

$ grep WARN ~/.cupel/cupel.log
2026-04-13 19:00:25 WARNING llm response truncated (hit max_tokens=16384)  model=gemma-4-26b-a4b-it-4bit elapsed=215.0s tokens=16384
2026-04-13 19:04:52 WARNING llm response truncated (hit max_tokens=16384)  model=gemma-4-26b-a4b-it-4bit elapsed=214.5s tokens=16384
2026-04-13 19:21:42 WARNING llm response truncated (hit max_tokens=16384)  model=gemma-4-26b-a4b-it-8bit elapsed=260.1s tokens=16384
2026-04-13 19:26:02 WARNING llm response truncated (hit max_tokens=16384)  model=gemma-4-26b-a4b-it-8bit elapsed=260.5s tokens=16384
2026-04-13 19:45:52 WARNING llm response truncated (hit max_tokens=16384)  model=gemma-4-26b-a4b-it-bf16 elapsed=349.2s tokens=16384
2026-04-13 19:51:40 WARNING llm response truncated (hit max_tokens=16384)  model=gemma-4-26b-a4b-it-bf16 elapsed=348.0s tokens=16384

"Gemma 4 31B 4 bit" is really good. it is a little on a slow side (21 tokens / second). But, as I mentioned before, preforms much better (for me) than "Gemma 4 31B 8 bit". I might however need better tests to see where 4bit starts losing to the full precision "Gemma 4 31B bf16", because as it stand right now they are peers.

I tested all of them yesterday before these template updates were made by Hugging Face, and they did perform slightly worse. The above it retested with these template updates included, so the updates did work.

I think it would make sense to hold on to "Gemma 4 31B 4 bit" for overnight complex tasks that do not require quick responses, and 21 tokens / second might be enough speed to churn through a few such tasks, but for "day time" it might be a little slow on a MacBook and "Qwen 122B A10B 4 bit" is still the local king. Maybe once M5 Ultra comes out + a few months to get it :), it may change.

context: this was prompted by the feedback in the reddit discussion, where I created a list to work on to address the feedback

[-]

Last_Mastod0n@reddit

I actually prefer Gemma 4 26b a4b because although it performs slightly worse (about 10% less observations in my project), it generates about 3x as fast. I am using the unsloth 6 bit quants for both models which have almost the same vram reqs.

So the 3x generation speed allows me to iterate a lot faster than I would if I was using the 31b model.

[-]