Gemma 4 26B-A4B GGUF Benchmarks
Posted by danielhanchen@reddit | LocalLLaMA | View on Reddit | 106 comments
Hey r/LocalLLaMA we conducted KL Divergence benchmarks for Gemma 4 26B-A4B GGUFs across providers to help you pick the best quant.
- Mean KL Divergence puts nearly all Unsloth GGUFs on the Pareto frontier
- KLD shows how well a quantized model matches the original BF16 output distribution, indicating retained accuracy.
- This makes Unsloth the top-performing in 21 of 22 sizes. Similar trend for 99.9% KLD and others.
- We also updated our Q6_K quants to be more dynamic. Previously, they were optimized, just now they're a bit better - no need to re-download though - it's up to you if you want a slightly better version. The previous quant was perfectly fine but this one is slightly bigger. The same was done for Qwen3.6.
- We're also introducing a new UD-IQ4_NL_XL quant that fits in 16GB VRAM. UD-IQ4_NL_XL (14.6GB) sits between UD-IQ4_XS (13.4GB) and UD-Q4_K_S (16.4GB). The same was done for Qwen3.6.
For HQ versions of the graphs as Reddit mobile compresses it. See: Gemma 4 Benchmarks and Qwen3.6 Benchmarks
We also updated our MLX quants to be more dynamic with better layering selection (there are limitations due to MLX): See here
| MLX Metrics | UD-4bit (Old) | UD-4bit (New) | MLX 4.4bit MSQ |
|---|---|---|---|
| Perplexity | 4.772 | 4.766 | 4.864 |
| Mean KLD | 0.0177 | 0.0163 | 0.0878 |
| 99.9% KLD | 0.8901 | 0.8398 | 2.9597 |
| Disk Sze | 21.4 GB | 21.6 GB | 21.2 GB |
Gemma 4 GGUFs: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF
Qwen3.6 GGUFs: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
hailnobra@reddit
While the benchmarks provide a pretty good guide and I agree Gemma 4 is quite smart and puts out great answers there is one small caveat...while it works (at least for me). For reference I was using the Q8 version from unsloth and the Quality version from mudler while trying to tame it.
I found that this model was highly defiant, liked to break the system prompt rules, hated calling tools, and constantly had issues with memory corruption from past conversations. I updated every time llama.cpp-server or openwebUI came out with an update, I constantly updated my GGUF files when new versions were available and tried both mudler and unsloth versions in an attempt to get this model to play nice on the home server. Every time I thought I had it working it would find a new way to break out and just cause chaos. It would either eat the reply in the think tag (hiding it in the reasoning), just decide to quit after a tool call, not call a tool at all even when I told it that there was no other choice, and loved hallucinating that I was in the future (even when the system prompt gave it the current date and time and explained that it has historical data).
In the end, the smart answers (when I could get one) could not get me to stay with it on my current setup. I switched over to Qwen 3.6 and that model has been a dream to work with. Yes it is more analytical in its answers and not as creative, but DANG does that model listen to orders. It loves liberal tool calling and will scour the web to a fault for information to try to provide the right answer. I haven't had it tell me I was making up a fictional future or defy its prompt outlining tool use once since loading it. That model has been a dream to work with in day to day use compared to Gemma 4.
WhoRoger@reddit
Gemma (E4B but also the other versions from what I've seen people say here) seems to think it's a bigger model than it is, so it tends to overreach.
But I'm finding it has a lot of insight and self-correction. E.g. if it doesn't follow its system prompt how you'd expect, ask it why, and it'll probably tell you how to fix it. Some models are better at this than others, but Gemma really seems to be able to reason really well about such stuff. So it seems more suited as a partner in crime than just a tool-calling robot hah. It needs a different approach.
hailnobra@reddit
I may give it another shot in my sandbox environment, but I have gone through multiple adjustments and system prompt rewrites both with Gemma's recommendations and further tuning with Gemini Pro and every time it makes strange decisions that lead it to ignore the system prompt or it gets confused and starts messing with the chat template as it goes deeper into conversations. I will then ask it what it was doing and if it understands its system prompt. It will acknowledge that it did not follow the prompt properly, tell me it can clearly see what it was told and that it violated the rule, give me advice to change it, and then find a new and fun way to violate it again. This really happens in longer prompts or when it actually uses a tool, so I feel like this is related to it not holding its system prompt in memory and then just doing what it wants.
And this was all with the Gemma 4 26B AoE model, I have no idea what E4B would be like in this scenario. I am also wondering if the Q8 quant is making it overthink and decide that it knows more than it really does. May try stepping the model down to Q6 or even Q5 unsloth based on what this chart shows to see if adherence is better because it stops overthinking and talking itself out of listening.
WhoRoger@reddit
Maybe try someone else's quant completely, I'm finding different Gemmas behaving quite differently, it may be super sensitive in strange ways. I've seen something similar with Phi.
But maybe it really just isn't suited to your workflow, since it is quite opinionated and people do report issues with tool calling. But it works for some, so... Idk.
ZBoblq@reddit
It's funny Gemini talks about a fictional future ALL the time, must be a google thing.
-Ellary-@reddit
It is fun how all those tests prize only Unsloth Qs showing them dominating the charts,
but my own tests show that Bartowski Qs are performing just the same and usually way stable.
Top-Rub-4670@reddit
Unsloth optimizes for KLD so naturally a chart showing that will make them look good, especially if you adjust the log to exaggerate the difference around the key quants (Q4).
But it's debatable how much impact a lower KLD has in practice, and I've also had better luck with bartowski. His quants are more predictable in terms of speed and output quality.
That being said I am very impressed how much quality unsloth manages to keep at lower quants so they'd be the first one I try if I intended to run a Q2 or Q3.
-Ellary-@reddit
Charts nowadays, eh.
WhoRoger@reddit
Lol nailed it. These charts...
DistanceSolar1449@reddit
Bartowski is like 30% faster speed for 5% less performance. I think small models where I want speed, he makes more sense. If I’m using a big model for max intelligence like Qwen 3.5 27b then I go with unsloth
Glittering-Call8746@reddit
So Gemma 4 models u prefer bartowski ?
DistanceSolar1449@reddit
Bartowski's always done better with Gemma.
He had the best Gemma 3 quant as well: https://www.reddit.com/r/LocalLLaMA/comments/1k6nrl1/i_benchmarked_the_gemma_3_27b_qat_models/
yoracale@reddit
Everyone optimizes their quants for KLD though (as it is the community standard), not just us so I don't see how it could only make our quants look good. Also everyone adjusts the log scale like we do so I don't really see how we exaggerate it. E.g. oobagooba also did benchmarks for Gemma 4 26b and this is how it's usually represented / done: https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchmark
yoracale@reddit
That's awesome to hear and glad you like Bartowski's quants as his are also top tier as well but in general, mean KLD is the community standard for quants and what mostly all uploaders/providers optimize for. If you want another measurement, oobabooga did specific benchmarks conducted on real world use-case datasets including for conversation, coding etc: https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchmark
IrisColt@reddit
Interesting, thanks, I tend to use Bartowski's too.
WhoRoger@reddit
What is the Y axis? I'm confused. The bottom can't be 0 if that's log scale, or we'd be talking in KLD in the millions lol. I'm guessing the differences are much smaller than the graph implies.
Far-Low-4705@reddit
UD-IQ2_XXS is a better quant at 9Gb than Q4_K_M from ggml-org at 16Gb
This is crazy stuff,
i remember the days where a Q3 or even some Q4 quants would produce completely garbled outputs.
po_stulate@reddit
They never disclosed what dataset is used to calibrate their quants and which dataset is used to measure KLD, if both are using the same dataset, then it really just means that their quants are better for the content that exists in that specific dataset, it is essentially benchmaxxing.
yoracale@reddit
oobabooga did benchmarks for Gemma 4 26b on a kld dataset using real world usecases including conversation and coding and unsloth still performs vastly better. see: https://localbench.substack.com/p/gemma-4-26b-a4b-gguf-quality-benchmark
We tested it on wikitext, etc which actually puts our quants unfavorably as our imatrix calibration datasets doesnt contain any.
po_stulate@reddit
Yes, the point is that unsloth never disclosed what datasets they used so we can't really know how to understand a graph like this, not how good the benchmark score is any individual conducted.
Far-Low-4705@reddit
The second you publish your data set, others will include it in their quants, then when you run your benchmarks it will look like yours is severely underperforming when it’s actually not.
Keeping a data set private is one of the only ways to ensure high fidelity data without data leakage.
WhoRoger@reddit
Kinda wondering if imatrix is even worth it at this point, or if everyone should cook their own based on what data is important to them.
altomek@reddit
This just shows how useless is KLD for quant camparison.
I think old models at Q3 will still produce completely garbled outputs even if you quantize them today.
Far-Low-4705@reddit
It’s not perfect, primarily because it’s only tested at a depth of 256 or 2k tokens. But it is the best we have and saying it’s not useful is completely wrong.
IMO, it should be tested at MUCH deeper depths, but it is quite expensive to run so that’s not really practical
Hobbster@reddit
I'd like to point out that "performance" and "kld performance" are not the same thing. So while I appreciate all the work and all the contributions, this is a bit of a marketing statement a little too bold.
yoracale@reddit
According to Google: "Performance is often evaluated by efficiency, quality, and results"
Performance can mean a lot of things, execution of quantization, how fast, optimized, accurate the quant is relative to others. In this case it is the performance of the KLD which we explained in the descriptions.
I don't see how it can be a marketing statement that is too bold. Especially when you could use the same performance word if you're comparing speed or accuracy which mean very different things, yet the word makes sense in both contexts.
I think it would be misleading to write 'accuracy' or 'intelligence' benchmarks but in this case, performance makes total sense.
Hobbster@reddit
I am referring to the sentence "This makes Unsloth the top-performing in x of y sizes" which is simply not true as stated, you have to be specific. You have pointed out with your own quote of the definition of the word perormance: that it is evaluated by different values. This sentence is generalizing kld performance to overall performance. This is not correct. I myself have experienced tradeoffs in processing time in UD quants vs. quants of Bartowski for example. Which is a performance measure. I have not claimed that my perfomance view is more important than others (another is reliability and stability). I pointed out, that the generaliization to "top performing" is less valid than written. A technique which is usually used in marketing - especially hiding the specifics in descriptions that don't get read as intensely as bullet points. And not writing it plainly in sight as a direct adjective.
As you should know as this is the centerpiece of everything AI: context is important.
yoracale@reddit
The two sentence above it state KLD and the sentence after the phrase you quoted states it's related to KLD. I think we're being a bit too nit picky here. But nevertheless, we'll be more specific next time so thanks for your feedback.
Hobbster@reddit
Sorry, it wasn't meant to offend or attack, it's just that I read it that way with a certain experience of many recent tests (and a lot of years in companies nitpicking such phrases extensively). I tried to minimize the confrontational part with "little bit" etc. And giving feedback is the only way we learn and improve. Again, I really appreciate everything you do!
yoracale@reddit
No it's totally fine not offended at all (just wish you could've worded your first comment differently but it's fine). Appreciate the feedback we'll take it into consideration next time! 🙏
uti24@reddit
This chart is useful, can you build one with best Gemma curve and best Qwen curve?
BitGreen1270@reddit
Noob here. I somehow got 26B working yesterday with ggml. Sorry but what is the key takeaways here? Unsloth 26B is better? In speed, reasoning or both? Thank you!
yoracale@reddit
Well according to mean KLD, lower is better.
BitGreen1270@reddit
Thank you. Strangely this one doesn't work for me. I have an onboard 780M with shared 32GB memory and I get \~19 t/s on ggml but unsloth takes forever to load and when it does, doesn't respond to prompts. Commands I'm using to run both:
./llama-cli -hf unsloth/gemma-4-26B-A4B-it-GGUF --fit on./llama-cli -hf ggml-org/gemma-4-26B-A4B-it-GGUF --fit onyoracale@reddit
Fit on isn't needed anymore btw. It could be because the model is like 2GB bigger than ggml's one. Have you tried a smaller version?
BitGreen1270@reddit
Thanks - yea, actually even the ggml version keeps dying after a while. I switched to `bartowski/google_gemma-4-26B-A4B-it-GGUF:IQ4_XS` and that ran much much better, but eventually died as well. Is there a way for me to prevent llama.cpp from restricting max memory usage?
Educational_Rent1059@reddit
Awesome work and good insight, thanks for your efforts
danielhanchen@reddit (OP)
Thank you!
ArtyfacialIntelagent@reddit
Did the Q6_K and Q6_K_XL points get mislabeled? The graph shows that Q6_K > Q6_K_XL in terms of file size, but the opposite holds when checking the repo: https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/tree/main
Also, however they're labeled, the larger quant has a worse KLD and is the only point off the Pareto frontier. Do you have any explanation for this?
So long and thanks for all the quants! :)
yoracale@reddit
Yes, you're correct, it was a hallucination from GPT, it's supposed to be swapped
KingGongzilla@reddit
goat
Hipponomics@reddit
Nice results! It would be interesting to see how they compare to the quants in ik_llama.cpp
LeonTheTaken@reddit
This is mean KLD?
yoracale@reddit
Yes that's correct.
LeonTheTaken@reddit
Why is it so big compared to Qwen3.6?
Complete_Instance_18@reddit
This is super useful, thanks for putting in the work!
yoracale@reddit
Thanks for the support!
jadbox@reddit
Q5_K_S looks pretty solid for its size and Mean KLD
nickm_27@reddit
I noticed a slowdown running Q4_K_XL a little over a week ago, went from 110 to 90 tok/s, haven’t done much direct testing to see if it was the updated GGUF or something in llama.cpp itself, just curious if this is a known thing.
danielhanchen@reddit (OP)
Oh ok was going to check but if you found the issue that's great!
annodomini@reddit
Would it be possible to better label the vertical axis? It has exactly one label, 10^0 (more commonly known as 1), so it's unclear what the other lines mean. Of course 0 would be the original bf16 model, but it's hard to say exactly where that falls on this chart.
It's not super important in a single chart as it's really the relative placement that matters, but it would be helpful to have the axis labeled if comparing between different charts, and I'm curious exactly where 0 is.
danielhanchen@reddit (OP)
Oh yes haha my script broke
Technical-Earth-3254@reddit
IQ4 XL? Sounds perfect!
danielhanchen@reddit (OP)
Yes a new one!
StupidScaredSquirrel@reddit
Thank you for everything you do!
danielhanchen@reddit (OP)
Thanks!
false79@reddit
I like Unsloth models. I really do.
But Gemma4 ones I find unstable.
Been using bartoski's releases and i find I have not had to restart llama.cpp.
yoracale@reddit
Hello may I ask what were the issues you faced for Gemma 4? We showcase how Gemma 4 E4B even at 4bit quantization performs toolcalling very welL
Processing img b3hl3fbv3fwg1...
false79@reddit
My setup:
-gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
-gemma-4-31B-it-Q4_K_M.gguf as well
-llama.cpp (b8728) 2026-04-16
-AMD 7900XTX 24GB
After extensive code generation with Cline, the LLM would crash and I would get dialog saying would you like to send the report to AMD. I am only able to repro this using this specific unsloth build.
I've used a number of other Unsloth models, absolutely flawless Qwen everything.
This one of my bat files that I use to launch:
"%\~dp0%SERVER%" \^
--model "%MODEL%" \^
--mmproj "%MMPROJ%" \^
--host 0.0.0.0 \^
--port %PORT% \^
--alias "%ALIAS%" \^
--temp 1.0 \^
--top-p 0.95 \^
--top-k 64 \^
--ctx-size 128000 \^
--reasoning off \^
--cache-ram 4096 \^
--jinja \^
--verbose
yoracale@reddit
Mmm very interesting thanks for sharing. Could it be possible that you tried unsloth quants first before all the llama.cpp and stability bug fixes which caused the many crashes and then afterwards you tried Bartowski's after everything is fixed?
Does the Unsloth quants still crash? It might also be because it is slightly larger. We don't do anything different than Bartowski's apart from a few different layer up casting etc so it really shouldn't affect the tooling you're using at all.
false79@reddit
I know when Gemma 4 got released, google had to do another follow up release within a few weeks. And yeah, llama.cpp also had similar issues too, they had to do follow up re-releases.
With each announcement, I always got the latest of the two up todate.
Bartowski doesn't offer any UD quants. I'm guessing maybe that might be a differentiator.
yoracale@reddit
There's no differenncr between UD and standard quants, the only difference is the layer selection and imatrix dataset. The instability issues may very well be because you firstly used unsloth quants before all the bugs etc got fixed
AvidCyclist250@reddit
Same. First time though. But I find Gemma 4 to be borderline unusable, it can't use web mcp
yoracale@reddit
It's most likely the tooling you're using as we showcase how well Gemma 4 4bit can utilize tooling or even Qwen3.6 2bit
Processing img e5ex9tu04fwg1...
AvidCyclist250@reddit
Isn't llama.cpp the backend here as well?
yoracale@reddit
Yes that's correct but with many extra features like self healing and optimized toolcalling etc
Turbulent_Pin7635@reddit
Unsloth and lm-community are love
coder543@reddit
I wish we could see a chart of some actual benchmark (or a composite benchmark composed of several benchmarks) versus model size across all of these quantizations.
yoracale@reddit
Everyone does KLD benchmarks and is the standard and seems to be what the r/Localllama audience wants. Doing more benchmarks can be an issue because it is very time consuming and expensive
coder543@reddit
Everyone does them because they are easy, yes. But what is the correlation between KLD and SWE-Bench? How much KLD is too much?
entsnack@reddit
I love Unsloth but when I read a research paper by the authors or X claiming X is SOTA I always pay attention to how close the second-best method (which the authors would not have optimized) compares.
Top-Rub-4670@reddit
A purveyor of fine quants since October 2023.
entsnack@reddit
gotta check em out, Unsloth was the only one I knew of and used.
Prize_Negotiation66@reddit
Bruh why mradermacher is so low... Is it because of moe? Didn't you tested his static quants?
In oobabooga 31b comparison there weren't much difference
Top-Rub-4670@reddit
mradermacher produces so much fucking quants it's hard to keep track.
They have 62,607 at the time of writing this comment. If I search for 26-A4B they have 23. And I count 4 (?) of those that are based on the base model from google. So, which one do you want unsloth to test?
Velocita84@reddit
Well well well APEX quants aren't so apex are they?
danielhanchen@reddit (OP)
Oh it's a combination of all!
Velocita84@reddit
I looked into your documentation a little bit and if i understand correctly your model specific datasets are instruct formatted, while Bartowski and presumably most of everyone else still use the raw calibration data v5
I had messed around with gemma4 26B on llama-perplexity about a week ago and noticed the model completely freaks out when processing non-instruct text, like with perplexity in the thousands. Instruct sequences turn out fine instead, though ones not generated by gemma itself still get pretty high ppl in the 70s. I haven't observed behavior this extreme with other models that i had tested, so i'm pretty confident it's specifically a gemma problem.
Could it be that using a non-instruct calibration dataset is much more detrimental to gemma than to other models?
audioen@reddit
Model only inferences correctly if it is in valid state, e.g. the input follows its expected chat template. Doing anything else can at least in part optimize the model performance in state which it is not normally in, and yes, I saw the 1000+ perplexity values as well, which indicate that model's not being fed correctly formatted input. It is getting more and more important with K-L divergence evaluations, imatrix, and perplexity. People have to sit up and take notice and do this right.
hdmcndog@reddit
Is the data available as a table, too?
That would make it a bit easier to analyse for me.
Also for Qwen3.6?
ikkiho@reddit
kld is the right metric here — perplexity deltas undersell how much quantized outputs actually drift on longer generations. the ud-iq4_nl_xl at 14.6gb is a really pragmatic sweet spot for 16gb cards running a4b architectures without swapping, that's the exact niche most 4070/4080 users are stuck in.
mr_Owner@reddit
Amazing!
At what ctx sizes where these benchmarks done?
ea_man@reddit
> We're also introducing a new UD-IQ4_NL_XL quant that fits in 16GB VRAM. UD-IQ4_NL_XL (14.6GB) sits between UD-IQ4_XS (13.4GB) and UD-Q4_K_S (16.4GB). The same was done for Qwen3.6.
That's a nice thing to do, providing specific quants so that people can get the best for their GPU.
Please target also 12GB with a little less headroom, es \~11.1GB .
yoracale@reddit
Thanks, we'll probably create one for the smaller ones too!
putrasherni@reddit
Come on now, KDL isn’t the only metric out there
yoracale@reddit
No it's not, but it's what the community wants and it's relatively a decent metric!! Also it's much cheaper + faster to run than other benchmarks
AuspiciousApple@reddit
That's a very nice result. However, how expensive would it be to run benchmarks and compare benchmark performance? KLD is a technical proxy for faithfulness, but I don't really care if the model phrases a sentence slightly differently if it writes correct code.
ikmalsaid@reddit
Could you create a single graph comparing the unsloth quants for Gemini 4 and Qwen 3.6? Thanks!
PaceZealousideal6091@reddit
Why? What's the point of comparing kld of quants of 2 different models?
Extra-Organization-6@reddit
good benchmarks. the MoE approach at 26B with only 4B active params means this should run comfortably on consumer hardware. curious how it compares to qwen 35B-A3B in practice since they are targeting similar use cases with similar active param counts.
Chromix_@reddit
It could be interesting to see if it'd be possible to ~~tune~~ quantmaxx a quant solely on "same top 1 token". Such a thing would perform horribly with the recommended settings, but probably better at temperature 0. After all that's where BF16 and quantized models start to deviate rather quickly - and more noticeably - when running like that.
danielhanchen@reddit (OP)
That generally is inline with KLD 99.9% :)
qfox337@reddit
Would it make sense to include inference speed benchmarks (I realize there's a big question of "on which hardware"), or is there usually little difference / performance impact of kernels for different compression schemes?
danielhanchen@reddit (OP)
Yes that is a future vector well optimize on but for now the IQ ones are generally always slower than comparable Q ones
a_slay_nub@reddit
Why does the chart only have 1 label on the y axis?
danielhanchen@reddit (OP)
Oop I need to fix that
reto-wyss@reddit
Have you performed an analysis on how KLD plays out in quants (across the spectrum) of newer models vs quants of older models.
Models get better and better at the same parameter size, so a reasonable hypothesis is that divergence is higher in newer models than in older models at the same quant level.
korino11@reddit
Where is APEX? I think APEX will beat unsloth))
danielhanchen@reddit (OP)
APEX is Mudler's quants - I already benchmarked all APEX ones - they're worse than Q8_0 and above Bartowski's Pareto frontier.
korino11@reddit
It give better output than every others. that the point!
grumd@reddit
Provide benchmarks and proof. Unsloth does actual benchmarks, going just by how you feel about a model is not constructive
korino11@reddit
SO you are lazy to check all becnchamrks that already EXIST ob it pages? maby you need to check it before write that questions?
danielhanchen@reddit (OP)
To add extra issues - LLM evaluations use temperature, so this will definitely change results.
One has to either set temp = 0 or use 10 set seeds for a fairer comparison.
danielhanchen@reddit (OP)
No - KLD 99.9% and KLD mean are gold standard quantization metrics - being "better" on singular runs doesn't mean it's better - one has to run it over 10 times.
KLD mean and 99.9% on the other hand is a deterministic method matching BF16 logits and the quantized logits.
If the distance between the quantized model and the BF16 logits is quite far (ie higher KLD), then you're not getting a faithful quantization but something else.
FrostyDwarf24@reddit
been loving Gemma 4 so far
Long_comment_san@reddit
This is pretty close to a breakthrough unless I'm reading it diagonally. While not so much with Gemma 4, but it should be a massive deal on larger models in the 300b+ range that are commonly crushed to something like Q1.