Gemma 4 31B — 4bit is all you need
Posted by tolitius@reddit | LocalLLaMA | View on Reddit | 72 comments
Gemma quant comparison on M5 Max MacBook Pro 128GB (subjective of course, but on variety of categories):
[gemma 4 leaderboard]()
the surprising bit: Gemma 4 31B 4bit scored higher than 8bit. 91.3% vs 88.4%. not sure why: could be the template, could be quantization, could be my prompts. but it was consistent across runs
[accuracy vs. tokens per second]()
[category accuracy]()
"Gemma 4 26B-A4B would get a higher score but for two questions it went into the regression loop and never came back, all the quants as well as full precision (bf16):
[24B-A4B failing some tests due to regression loops]()
I configured "16,384" max response tokens and it hit that max while looping:
$ grep WARN ~/.cupel/cupel.log
2026-04-13 19:00:25 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-4bit elapsed=215.0s tokens=16384
2026-04-13 19:04:52 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-4bit elapsed=214.5s tokens=16384
2026-04-13 19:21:42 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-8bit elapsed=260.1s tokens=16384
2026-04-13 19:26:02 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-8bit elapsed=260.5s tokens=16384
2026-04-13 19:45:52 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-bf16 elapsed=349.2s tokens=16384
2026-04-13 19:51:40 WARNING llm response truncated (hit max_tokens=16384) model=gemma-4-26b-a4b-it-bf16 elapsed=348.0s tokens=16384
"Gemma 4 31B 4 bit" is really good. it is a little on a slow side (21 tokens / second). But, as I mentioned before, preforms much better (for me) than "Gemma 4 31B 8 bit". I might however need better tests to see where 4bit starts losing to the full precision "Gemma 4 31B bf16", because as it stand right now they are peers.
I tested all of them yesterday before these template updates were made by Hugging Face, and they did perform slightly worse. The above it retested with these template updates included, so the updates did work.
I think it would make sense to hold on to "Gemma 4 31B 4 bit" for overnight complex tasks that do not require quick responses, and 21 tokens / second might be enough speed to churn through a few such tasks, but for "day time" it might be a little slow on a MacBook and "Qwen 122B A10B 4 bit" is still the local king. Maybe once M5 Ultra comes out + a few months to get it :), it may change.
context: this was prompted by the feedback in the reddit discussion, where I created a list to work on to address the feedback
Last_Mastod0n@reddit
I actually prefer Gemma 4 26b a4b because although it performs slightly worse (about 10% less observations in my project), it generates about 3x as fast. I am using the unsloth 6 bit quants for both models which have almost the same vram reqs.
So the 3x generation speed allows me to iterate a lot faster than I would if I was using the 31b model.
tolitius@reddit (OP)
interesting!
what is the tokens per seconds you are getting with this setup? and also does it perform noticeably better than its Qwen counterpart: "
Qwen 3.5 35B A3B"? I did not tests them one to one as a coding agent.Also I am a little worried about 26B A4B going in these regression loops: have you experience it? Because if not, I might need to try GGUFs with "
llama.cpp" to see whether that may be the solution (currently I am running the MLX community quants).Last_Mastod0n@reddit
So the answer to this question is a bit more complicated than I thought. If I just give it a difficult coding question as a single prompt then I get about 45 tokens / sec. This is with unsloth/gemma-4-26b-a4b-it with 8 expert layers offloaded to the CPU.
However in my personal project I am running 3 concurrent predictions so the actual speed is quite different. I made a script to use all of the timestamps and total tokens generated from all of the responses generated in a pipeline pass and got 173 tokens / sec with the 3 concurrent predictions.
The big takeaway for me was that doing concurrency actually barely affects individual response generation speed. I thought that with more concurrent predictions the GPU would pull more power since the total tok/s is several times faster, but nope its pulling almost exactly the same amount of power. So it might be something worth checking out on your Mac.
Also regarding your question about looking into llama.cpp I would definitely stick to the MLX optimized quants. The people working on the MLX stuff are literal wizards so you should be fine.
tolitius@reddit (OP)
agree, that they are
that's interesting, Qwen 3.5 35B A3B runs a lot faster for me (on a Mac). I thought it is because of Gemma internal architecture/layering differences, but mainly because Qwen 3.5 35B A3B activates 3B per token, while Gemma 24B A4B activates 4B, and hence steals more memory bandwidth: is that math different on RTX 4090?
k_means_clusterfuck@reddit
im also using unsloth q6 k xl for gemma4 31b it. Wonder how it performs
Last_Mastod0n@reddit
Very nice I think its about as good as it gets. That model should only be within 1-3% percent weaker than the q8 version.
Herr_Drosselmeyer@reddit
Seems odd that the Q8 would perform worse than the Q4. Can you link the exact quants you tested?
Chromix_@reddit
The reason is another one. This test used 23 prompts, each prompt can score between 0 and 3 points apparently. This low number of prompts is nowhere near enough to benchmark a probabilistic LLM accurately.
tolitius@reddit (OP)
fair point on 23 prompts. but these are not MMLU multiple choice — each one is a multi-step task, some are multimodal, with tool calling, messy real world data, structured output, etc. a single prompt can take 2,000+ tokens to respond to.
that said, 23 is still 23. will need to work on expanding the set. am interested in some feedback here that Q8 would perform much better for visual reasoning, so need to add more of those.
RipperFox@reddit
Do a little experiment - use a fixed seed and do the 23 tests - note the results. Now only change the seed (but keep still fixed) - how much deviation in test results would you expect only from changing the seed? If the variation is high, you don't have enough points, right?
tolitius@reddit (OP)
actually I just checked oMLX code, and it seems that cupel override (which is "temp 0" by default) takes precedence
but would still need to add the seed
Noxusequal@reddit
Do you do this in batches or one after another?
If you wanna do this in a way where you get deterministic results you can set a seed set temp 0 run with batch 1.
Then you run this at least 3 times with 3 different seeds leaving the same seed locked for all 23 prompts each time. (The more the better) Now you can start looking at trends and calculate errors. Something like a paired t test can then tell you if the differences are statistically meaningful.
tolitius@reddit (OP)
yep, thanks
this aligns well with other recommendations in this discussion
I don't do these in batches as I want to make sure the model has all resources it needs when working on a single task
but I do need to add a seed for sure to see whether my tasks are as good as I think they are
temperature 0 + seed might not work though because it would always return the highest logit. temperature 1 (Google recommended) + a seed to keep it consistent should. + another seed, and another seed
tolitius@reddit (OP)
yea that's a good experiment. will do that along with "
temp=0" to see how much of the 4bit vs. 8bit gap is true vs. sampling noiseakumaburn@reddit
It's possible that an over fitted model generalizes better when quantized.
tolitius@reddit (OP)
ah.. I should have specified I use "
mlx-community/gemma-4-*" quants, such as mlx-community/gemma-4-31b-it-8bittemplates were updated yesterday, and I suspect there are more fixes coming
it could also be the way MLX converts them. maybe I should try GGUFs. Have you tried all 3: 31B 4 / 8 / 16 bits in GGUF / different hardware, and do not see 8 bit under perform?
styles01@reddit
I’m having really good success with 26-A4-q4 - it’s fast AF on M4Max (70t/s out)
tolitius@reddit (OP)
yea, from my limited use it's a really good model
the only problem I saw with 26B A4B is regression loops where it starts repeating the same tokens over and over. I would expect it to be quantization caused, but I see the same behavior in bf16 as well
Far-Low-4705@reddit
these models are not deterministic, they are statistical models... did you run multiple runs to gather statistics on mean score and the variance??
tolitius@reddit (OP)
I did run it a few times with
"temperature": 0(to remove sampling variance) and"temperature": 1(what Google recommends). Did not record the variance, I should do it next time I run these.But in both cases 4 bits outperforms 8 bits
Far-Low-4705@reddit
that is not the same thing.
ideally you should run the model with the recommended settings, and if you can, run it several times and collect data on the means and standard deviation/variance. that is most likely what explains your 4bit doing better than 8bit
GeorgeSKG_@reddit
What is the it and difference between the simple version
tolitius@reddit (OP)
"
it" in "gemma-4-26B-A4B-it" and other models usually means "instruction tuned""
gemma-4-26B-A4B" (without the prefix) is a base, pretrained model that is able to predict the next token, but it won't be able to follow instructions"
gemma-4-26B-A4B-it" it is a "gemma-4-26B-A4B" model that were further instruction fine tuned (supervised + RLHF), after which it is able to follow instructions, do coding, reasoning, assume roles ("you are a helpful assistant"), do multi-turn conversation, tool calling, etc.tavirabon@reddit
4-bit getting the same score as 16-bit (21 out of 23) while 8-bit is lower (20 out of 23) is a pretty good indicator of 1) problems with the quantization process or 2) problems with the test. My gut says it's the second one.
tolitius@reddit (OP)
sure hence:
what I would like to understand more of is why would 8 bit lag behind the 4 bit verions. all 31B quants went through the test well: did not regress, thought well, etc. but I suspect there is something different about 8-bit one. It could be my tests of course, so I'll try harder. However it could also be the quant itself, MLX conversion, templates, etc. Which is also interesting to track
a_beautiful_rhind@reddit
Did you use greedy sampling and the same seed? How many times did you run the test. Without repeatability, this ish is kinda worthless. Random chance could be bigger than "difference" between quants.
tolitius@reddit (OP)
while I did run it 8 times (e.g. just a single 31B bf16 run takes about 76 minutes) I used temp=1.0 with top_p=0.95, not greedy.
fair criticism: will re-run with temp=0 to remove sampling variance and see if the 4bit/8bit gap holds
audioen@reddit
No, that is not appropriate. It is fine to keep temperature -- what you probably need to do is vary the random seed of the LLM sampler, so that as you do repeated runs, the LLM explores different answers to the same questions as driven by the pseudorandom generator that picks tokens.
Worse quants either predict worse completions, and show confused reasoning and similar problems that prevents recovery, while good quants can be expected to better tolerate random sampling.
Putting temperature to zero just means that every run should result in the same completion, but that completion itself is to some degree result of a random process. The variations in model reply in that case are due to the quantization error perturbing the model in billions of different ways.
tolitius@reddit (OP)
makes sense Google does recommend temperature to be 1.0 I actually ran with both: "
1.0" and "0" (same 8 bit vs. 4 bit results)and yes, will implement different seed, it is a good idea to understand whether my test set is too small to produce stable results
tolitius@reddit (OP)
checked oMLX code, and it seems that cupel override (which is "temp 0" by default) takes precedence
but would still need to introduce the seed and run with different ones
audioen@reddit
It is just random chance. You should look into some statistical theory and derive the error bars for tests like MMLU, but basically depending on size of effect you plan to measure, you have to make the sample size larger to drive the probability that your result is due to random chance below a certain % point. Science often uses 5 % as the rule for statistically significant results.
Erwindegier@reddit
Have you tried qwen3.5 with these tests? Gemma4 still doesn’t work great for coding where qwen3.5 35b a3b q8 is now workable with about 50 token/second.
tavirabon@reddit
Gemma 31B is the best local coding model I've been able to run. It listens extremely well and the code is highly readable. Even Qwen 27B is better, 35B is not pleasant to use at all.
Erwindegier@reddit
Trying qwen 27b q6 now, it’s half the speed of 35b. Any tips? Is the output worth the speed penalty? 35b is the first local model that I found worked well enough for me (M2 Max 64GB).
audioen@reddit
I believe that the output quality is worth the speed penalty, but you still might want to run the MoE model for quick replies and do 27B when you're away from the computer and don't care that it takes a while.
tylerrobb@reddit
From my understanding, the Qwen-3.5-27B is a dense model vs. Qwen-3.5-35B which is a MoE (mixture of experts) model.
The 35B MoE model will only activate 3B parameters at a time. It's always going to be faster than the 27B dense model that activates all 27B parameters on each request.
If you need faster tokens per second, go with MoE. If you want better logic/accuracy, stick with the slower dense model.
tolitius@reddit (OP)
it would also depend on the size of the model for example "
Qwen 3.5 122B A10B" can definitely be compared with dense models such as "Qwen 3.5 27B" and "Gemma 4 31B"this comment by u/Expensive-Paint-9490 puts it best
because while it has only 10B parameters activated per token, it just stored a lot more knowledge in the overall 122B parameters (and 256 experts)
tolitius@reddit (OP)
what hardware are you running it on?
I would love to run it as a coding model, but, at least on M5 Max 128GB it is 7 tokens per second degrading to 4 tokens per second on larger context
tavirabon@reddit
3090 and a 3060 for the spillover. I get roughly 25-30t/s at 0 context depending on config, it starts dropping under 20 as the context fills up.
tolitius@reddit (OP)
that's decent, are you able to be productive on 25 t/s? maybe I need to single out flows where it is ok
the current coding agents I use are 40 / 50+ t/s (sometimes more, depending on a task)
MiaBchDave@reddit
You guys are mixing quants and T/S generation speeds to compare. The 128GB M5 Max is BF16 Gemma 4 31B at 7-9 T/S . The guy with the 3090 and 3060 spill-over is not running BF16 Gemma 4 31B (DENSE) at 25-30 T/S - it would be something like 0.85 T/S with CPU offload 😂
This is basic stuff, but if people are new and reading this, they may not know that Quants are faster.
tolitius@reddit (OP)
yes, my 7 tokens per second is 31B bf16, as dense as they come.
my question was more about whether you can feel productive at 25 tokens per second (with full precision or the best quant on a given hardware)
MiaBchDave@reddit
This is a good question for getting Gemma 4 31B working at high enough quality and fast enough on an M5 Max 128GB. My guess is that some combination of draft model and maybe an 8bit Quant of the 31B would work. Gemma 4 E2B can be used as a draft model on llama.cpp for Gemma 4 31B for speculative decoding right now and the speedup is pretty good. oMLX should support Gemma 4 speculative decoding (now/soon?) as well for best performance.
tavirabon@reddit
It's ample for using it as an assistant. The bigger bottleneck is me reviewing.
tolitius@reddit (OP)
yes, Qwen is awesome I have some Qwen (same) tests here: https://www.reddit.com/r/LocalLLaMA/comments/1sfr6u4/m5_max_128gb_17_models_23_prompts_qwen_35_122b_is/
but if I can make a similar comparison for Qwen specifically if there is an intrest
Pleasant-Shallot-707@reddit
I think the slowness is due to needing better optimization on the server side.
tolitius@reddit (OP)
I'd like to try it. what would you recommend? (hardware: M5 Max MacBook Pro 128GB / 40 GPU cores)
Pleasant-Shallot-707@reddit
What I’m saying is that I don’t think Gemma 4 has been optimized on many of the servers yet.
tolitius@reddit (OP)
ah.. fair.
for this test I used oMLX and mlx-community/gemma-4-* quants.
given the questions here it might make sense to retest with GUFFs via "
llama.cpp". Any particular GUFFs you would recommend?Pleasant-Shallot-707@reddit
I don’t have any. Native MLX should perform best compared to GUFF.
TassioNoronha_@reddit
A bit too slow for me on 48gb 4max but thanks for being posting these benchmarks, it really helps me to visualize if it worth or not the upgrade for a 128 m5 max
tolitius@reddit (OP)
I think it still very worth it
you can run Qwen 397B
as well as enjoy the current best (subjective / for me): Qwen 122B A10B
Maximum-Wishbone5616@reddit
Nope, that is not true. Your benchmark is flawed.
tolitius@reddit (OP)
yes, you might be right what would be a set of prompts where 8-bit does better than 4-bit? can you share them if possible?
it could indeed be my tests, but I did try a few different sets and the behavior was consistent, so looking a little outside of the bubble could help to determine whether the problem is in tests
CooperDK@reddit
Do yourself a favor, ditch Apple for AI generation, get CUDA, preferably Blackwell, and use nvfp4 which is more precise and faster. Apple has a nice architecture but it is really not compatible, just like AMD. The entire AI ecosystem was built around CUDA. and nvfp4 only works with the 50xx and 6000 Pro series Nvidia cards
putrasherni@reddit
I don’t think that’s the case anymore
Apple ai stack will be on par with cuda within a couple of years
Pace of software development is incredible right now
Signor_Garibaldi@reddit
look at amd, they tried to catch up for years with cuda, cudnn etc. and they still aren't there, apple is much earlier and focus of their business is elsewhere, it's not very probable but we keep our fingers crossed for them to break cuda monopoly
putrasherni@reddit
I disagree on this too
AMD software stack on Linux right now, for the physical memory bandwidth of their R9700, is really impressive compared to Cuda
Think of it this way , both AMD R9700 and NVIDIA 5090 are 32GB GPUs, former is 640 GB/s and 5090 is 1,792 GB/s
Comparing their performance on AI inference and workloads 5090 is not 3 times faster than R9700 on dense models, probably 1.75 times only
And on MoE models, the difference is less than 20%
Signor_Garibaldi@reddit
I'm eyeing amd and while R9700 is a step in good direction for homelab uses, the breadth of their ecosystem is smaller and production-ready capabilities are not there. They are closing the gap in genai and for folks that are focused on LLMs only, in their homelabs it might be enough, but i would argue that for production grade complex systems, robotics, simulations, real time multimedia processing frameworks, omnverse nvidia has wonderful tooling not replicated in amd ecosystem.
tolitius@reddit (OP)
definitely a trade off and I agree RTX PRO 6000's 1,792 GB/s memory bandwidth might just worth it (No NVLink on PRO 6000 though)
I was (and still am) contemplating of building a rig at some point, but to get it to 128GB, or even larger, of well performing CUDA is pricey, but also power hungry to run 1000W++ under load, noisy, etc.
Might still worth it though.
But.. I do love Mac architecture, it just getting better and better and MoE + imatrix + adaptive layer quantization + "more ideas" bring those big models (Qwen 397B, etc..) closer and closer to be really great on Mac.
Geritas@reddit
That 96 gb gpu alone will set you back 10k lol.
Temporary-Mix8022@reddit
I've been looking at RTX6000s lately.. and a full system build.
I just cannot reconcile your "the same price" claim.
The starting price of the RTX6000 is more than any of the Apple machines with 128GB VRAM?
While I don't dispute that it's faster.. I mean.. your whole thing on price isn't true.
Most of the time for a 6000 build you're looking at $10k+
segmond@reddit
The advantage of 8bit over 4bit doesn't show in these benchmaxxed benchmarks. It shows in precise work. It shows when you use the vision capabilities. Go as high a quant as you can, and only downgrade if you don't have the VRAM or you are doing work that doesn't require precision.
sasquatch3277@reddit
Well, it seems like a trade-off still. In my maybe naive understanding, it seems more parameters will get you actually different better output whereas like going Q4 to Q8 will drastically increase size with relatively minimal gain. Or at least the gain is just different.
In my testing with heavily quantized models it seems like the model will output very slightly adjacent tokens that are incorrect/hallucinated think horse instead of house.
So I guess if a hallucination is very undesirable, like especially for a longer context, then unquantized is the way to go. And obviously once that one token is like just wrong it derails the entire response.
But if you're just looking lower entropy as much as possible in your computer, I feel like more weights is better, like the most weights you can fit at Q4 is seemingly a good default.
Like, I really don't know if there's a proper mathematical way to describe this (bpw?)
tolitius@reddit (OP)
good idea. will experiment more with multimodality. I do have multimodal tests in the current "benchmaxxed benchmarks" where 8 bit did worse actually than 4 bit.
but, I'd like to change the way I test to figure out where 8 bit shines. If possible, can you share a few visuals and prompts I can try these quants on to harness the quality of the 8 bit one?
anotherwanderingdev@reddit
Have you gotten Gemma4 or any other model that runs locally on your 128GB MacBook Pro to work well as a coding agent, like Claude Code with Sonnet or Opus?
I was able to get Gemma4 working well as a chat ai on a similar Mac, but performance dropped horribly when I tried to use it for coding.
I know fixes keep coming out. I last tried with Ollama 0.20.5.
Sixstringsickness@reddit
I am on strix halo, I found similar behaviors. When using speculative decoding I was able to achieve 20tps for non coding tasks, and that dropped to 14tps for coding tasks.
Additionally it seemed to struggle with reliable tool calling compared to qwen models, I has to add some programatic reinforcement to ensure the calls didn't fail.
Reaper0n3@reddit
Really curious what backend you're running on strix halo and how big a context. I'd love to try it also.
Sixstringsickness@reddit
Using llamma cpp via the fedora toolboxes, for Gemma 4 seemed to need ROCm for spec decode, and RADV was about half the speed. MoE models were much faster on RADV (saw PP drop on Qwen 122b, but TG was up more than enough to offset).
tolitius@reddit (OP)
I have tried it. 31B is not very practical on my hardware (could be really good on CUDA though), but 24B-A4B is actually quite decent speed / quality wise.
But for bf16 (to retain quality) I also noticed a significant drop in tokens per second: 47 t/s to 26 t/s on a larger context
I did find a model I can run locally close to Opus, but I have hopes
Long_comment_san@reddit
"the surprising bit: "
Right...
boutell@reddit
Sounds like four surprising bits to me