It looks like we’ll need to download the new Gemma 4 GGUFs
Posted by jacek2023@reddit | LocalLLaMA | View on Reddit | 110 comments
https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF
by u/danielhanchen:
We just updated them again in response to:
- kv-cache : support attention rotation for heterogeneous iSWA https://github.com/ggml-org/llama.cpp/pull/21513
- CUDA: check for buffer overlap before fusing - CRITICAL fixes
<unused24> tokenshttps://github.com/ggml-org/llama.cpp/pull/21566 - vocab : add byte token handling to BPE detokenizer for Gemma4 https://github.com/ggml-org/llama.cpp/pull/21488
- convert : set "add bos" == True for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21500
- common : add gemma 4 specialized parser https://github.com/ggml-org/llama.cpp/pull/21418
- llama-model: read final_logit_softcapping for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21390
- llama: add custom newline split for Gemma 4 https://github.com/ggml-org/llama.cpp/pull/21406
Skyline34rGt@reddit
Better question, do we need new heretic versions too + quant of them
Corosus@reddit
Trickle down quantinomics.
Anjz@reddit
Quantflation
Maleficent-Ad5999@reddit
Heretic abliterated Claude opus distilled would be fine
Dwansumfauk@reddit
Probably just the quants since training doesn't use llama.cpp
-p-e-w-@reddit
Heretic uses Transformers, so the safetensors files made with Heretic should be fine. They only need to be re-quantized to GGUF because the llama.cpp implementation of Gemma 4 had a few bugs initially.
Curious-Still@reddit
Do the bartowski versions need updating too or just unsloth?
noneabove1182@reddit
I don't think mine need updates desperately since my most recent upload, many people told me mine were the only ones behaving normally after I uploaded post-tokenizer fix
I'll be investigating with latest changes, but I'll take my time with it so that people don't have to redownload too many times when it's already in a decent state
IrisColt@reddit
Get my insta-upvote, king.
Vytennis@reddit
you are the GOAT! kudos from Lithuania
fettpl@reddit
I'm just taking the opportunity to say you are doing a solid piece of an amazing work. Thank you.
jld1532@reddit
Good man
danielhanchen@reddit
Yes everyone needs to reconvert if they used imatrix since the activation patterns are now different
ecompanda@reddit
the note about activation patterns changing with imatrix is the part that actually matters here, not the BOS token flag. BOS is runtime configurable but a stale imatrix means the quantization itself is optimized for the wrong patterns.
for people running the 31B: is the ppl difference large enough to justify the redownload, or is this mostly a correctness fix that only shows up in edge cases
ArtArtArt123456@reddit
i didn't download the previous ones but i plugged this into one of my qwen vision workflows as-is and it worked right out of the box and was much better at the task too. pretty pleasantly surprised here.
chickN00dle@reddit
Personally I found that Qwen3.5 was better at vision tasks / visual understanding compared to Gemma 4 by quite a margin. Both MOE versions at q5_k_m.
Top-Rub-4670@reddit
I found the same on my side. Both Qwen 3.5 9B and 35B-A3B do better than Gemma 4 26B-A4B in my vision tests.
When I give hints to Gemma 4 it seems to understand as well as Qwen 3.5, but it's not clear if it's just being a sycophant or if vision is good but hindered by bad initial communication.
nickm_27@reddit
Unfortunately it seems Gemma4 was not trained on video, so temporal tasks Gemma4 is quite bad at unfortunately
WhoRoger@reddit
The smaller models should know video, afaik
nickm_27@reddit
sorry, not sure what you mean, my comment was talking about what the vision was trained on not if it functions at all
WhoRoger@reddit
Oops my bad
RanklesTheOtter@reddit
I literally just finished a fine tune, time to start over. 🤬
jacek2023@reddit (OP)
you finetune gguf?
RanklesTheOtter@reddit
No haha I just can't read English. I thought Google was broke it the original model and uploaded a fix. 😅
I guess I'm ok then.
signal_overdose@reddit
Just use bartowski's GGUFs instead so you don't have to update your models every week...
Most popular does not mean best quality...
Iory1998@reddit
I see that even the 31B was updated
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main
the-orange-joe@reddit
Is the Q8 quant not affected? It has not been updated.
FrodeHaltli@reddit
The Q8 quant was the only one that wouldn't spit out gibberish for me.
grumd@reddit
You using cuda 13.2?
FrodeHaltli@reddit
I am on 12.4
Dany0@reddit
Can't we get the MTP from LiteRL?
segmond@reddit
No biggie, I now expect to download any new model 3x-5x before it becomes stable. if it's a big one, I usually wait for about a week. For example, I'm waiting till the weekend before I begin downloading GLM5.1
Dany0@reddit
Wish LM studio would automatically notice this
Final-Rush759@reddit
GLM5.1 is 5.0 with more post training.
shockwaverc13@reddit
this is the llama 3 tokenizer issue all over again
----Val----@reddit
When you live on the bleeding edge, expect to get cut.
Borkato@reddit
Caffdy@reddit
wtf, this meme is too deep-fried for my fried-mind. Would love some help here explaining it hahaha
Borkato@reddit
Hahaha it’s a reaction image basically saying “this must be the worst comment ever” but phrased as “is this the worst comment ever?”. And then it erases worse and puts best lol
MoffKalast@reddit
This is why I wait at least 2 weeks before trying out models.
AlwaysLateToThaParty@reddit
Without knowing the architecture before it's released, this will always be an issue. Anyone can always use the models directly using source. These inference handlers, like llama.cpp, exist to present a common interface. But the connections of that interface to the models will depend upon the achitecture of those models. If you want to use a common interface for different models, the interface will need to be updated when the model architecture adds new features.
SnooPaintings8639@reddit
I feel like there is "something" that needs fixing in GGUFs couple of days after each of new models' batch drop. It seems like the early-tests-and-opinions of any model are rarely objective. It's just wise to wait a week and let it finish all the technical evaluations.
Borkato@reddit
And the qwen 3.5 issue all over again
jinnyjuice@reddit
What was it, exactly?
sersoniko@reddit
And the GPT-OSS issue all over again
some_user_2021@reddit
The user is criticizing OpenAI's policies, this is against my guidelines. I'm sorry, I can't help you with that. I cannot tolerate hate speech and violence.
Borkato@reddit
Lol violence
xXprayerwarrior69Xx@reddit
a_beautiful_rhind@reddit
So I should reconvert the 31b as well?
danielhanchen@reddit
It's ongoing!
Flashy_Management962@reddit
thank you for your work btw! One little question though, is it normal that I get with your iq4nl gemma 26b this perplexity "Final estimate: PPL over 576 chunks for n_ctx=512 = 26296.2393 +/- 532.75059" - with the bartowski i get around 200.
jacek2023@reddit (OP)
I am trying to understand why listed changes require new GGUF and why 31b is not updated, but I see GGUFs for 26B and E2B are new
mtomas7@reddit
I noticed that with the last llama.cpp that was shipped with LMStudio 0.4.9.1 the
<|think|>token stopped working per your manual.Polaris_debi5@reddit
All models have just been updated, from e2b to 31B. They are now available at huggingface.
Kitchen-Year-8434@reddit
These all looked server-side to me too.
a_beautiful_rhind@reddit
Well I'm reconverting anyway.. the only difference so far is that it sets the BOS token to true. Hopefully there are other differences.
I was smart to just d/l the full model.
Borkato@reddit
Does this mean I need to download the heretic versions again in like 3 days once they finally come out
a_beautiful_rhind@reddit
For the 31b set the BOS token to true from the command line or inside the metadata.
dampflokfreund@reddit
That will affect it only if you are not using the chat template since that adds the bos already.
a_beautiful_rhind@reddit
I think on mainline llama.cpp it added it to text completions but in IK_llama it didn't.
Borkato@reddit
👀 cool, thank you! What about the other versions
a_beautiful_rhind@reddit
No idea. Only used the 31b.
Borkato@reddit
No worries! Thank you!
koygocuren@reddit
What about qwen 3.5 moe models?
Existing_Director_48@reddit
Here shows there are two versions on unsloth studio for the 26b, one all caps other small caps. I dont know what to pick.
Quozul@reddit
I've set up a crontab to re-download models every day, just in case.
Ambitious_Ad4397@reddit
Does it support Turboquant?
jacek2023@reddit (OP)
TurboQuant, OpenClaw, BitCoin and Taylor Swift
Ambitious_Ad4397@reddit
What's wrong with that question?
dinerburgeryum@reddit
Lack of detail on what you're hoping to accomplish.
Ambitious_Ad4397@reddit
Use turboquant with Gemma, to reduce memory usage by context
dinerburgeryum@reddit
These are llama.cpp specific changes, and llama.cpp does not support TurboQuant as an option for KV cache compression. So no, TurboQuant is not relevant to this discussion.
nickm_27@reddit
A particular model does not have support for TurboQuant, all models support TurboQuant if your inference provider supports it. Currently I don't believe any of them have an actual TurboQuant implementation. So asking if a particular model supports TurboQuant is a generally incorrect question to ask.
Ambitious_Ad4397@reddit
So, for example, If ollama would support TurboQuant, then all models that ollama can work with would support TurboQuant?
nickm_27@reddit
yes, just like Q4_0, Q8_0 etc
erazortt@reddit
TurboQuant is used for the KV cache. There is not relation of the KV cache quant to the model used. If KV chache is supported or not is only dependent by the implementation of inference software.
PunnyPandora@reddit
they're just redditing
learn_and_learn@reddit
I even heard that it's got electrolytes
Cool-Chemical-5629@reddit
But can it run Crysis?
MaxKruse96@reddit
in that order
a_beautiful_rhind@reddit
Hannah Montana linux!
PunnyPandora@reddit
no. wait until the schizos figure it out https://github.com/ggml-org/llama.cpp/discussions/20969
Long_War8748@reddit
This is why I wait at least 2 weeks before touching new models 😅....
popoppypoppylovelove@reddit
I downloaded gemma-4-26B-A4B-it-Q8_0.gguf yesterday, but this file wasn't updated? were Q8_0 and UD-Q8_K_XL not affected by the changes?
beneath_steel_sky@reddit
Uh oh. Waiting for /u/noneabove1182 :-)
noneabove1182@reddit
I may update them but in practice my quants have seemed to already been working very well for people, and most of these fixes seem to be run-time related rather than conversion
only one that's conversion-time is the BOS token and even that one is confusing with people reporting results in both directions (improve and detriment)
mark_haas@reddit
daniel said the activation patterns have changed so he had to redo the imatrix ones, does this affect your quants?
noneabove1182@reddit
That was mainly the tokenizer issues that I already have fixed in my latest uploads
I'll definitely be giving it a look to double check but for now I've received numerous reports that mine were the only ones working, by now maybe others work too but I'm pretty confident mine are fine for now and I can take a bit more time to double check everything before spamming uploads
mark_haas@reddit
thanks
WhoRoger@reddit
Is this why my E4B keeps invalidating the KV cache and reloading context? The console output said something about SWA. I've no idea what's going on :P
c64z86@reddit
Sorry if this is answered somewhere and I just don't understand it, but do the new versions fix the bug where Gemma 4.just outright stops responding after a while?
Mashic@reddit
Do these changes come with any speed, memory, or accuracy improvements?
Ok_Mammoth589@reddit
Nope, they're just twiddling bits for no reason
__Captain_Autismo__@reddit
Not sure what's been changing but at first they were unusable in my harness and now these models are #1 for web dev work out of what I tried locally. Just did an internal shootout yesterday.
This was before the newest update.
Awesome to have tools that level up without having to do anything.
Surprisingly 31b at q8 did better than b16
dampflokfreund@reddit
No we do not need to download new GGUFs. These PRs are just on the inference side, even number 4 is still compatible with old ggufs.
danielhanchen@reddit
I would have wished if this was the case, but for imatrix ones I had to redo them since the activation patterns have changed
FluoroquinolonesKill@reddit
What about the non-imatrix ones? Do those need to be re-downloaded? E.g., UD-Q8_K_XL and UD-Q4_K_XL.
gigaflops_@reddit
I feel like in the age of multi-gigabit internet connections, it's usually easier for most people to redownload the whole thing
Velocita84@reddit
For the sake of being thorough the BOS thing can also be solved by a simple metadata edit
uvx --from gguf gguf-set-metadata [path to model] tokenizer.ggml.add_bos_token truesegmond@reddit
yeah, we need to download new GGUFs, they regenerated them and improved them.
Borkato@reddit
Oh, this is lovely if true
Corosus@reddit
Latest llama.cpp Latest fresh downloaded gemma-4-26B-A4B-it-UD-Q4_K_M.gguf Latest opencode launched in powershell
← Edit watut_1.21.6 - Copy\src\main\java\com\corosus\watut\client\screen\ScreenParticleRenderer.java Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.
→ Read watut_1.21.6 - Copy\src\main\java\com\corosus\watut\client\screen\ScreenParticleRenderer.java [offset=357]
← Edit watut_1.21.6 - Copy\src\main\java\com\corosus\watut\client\screen\ScreenParticleRenderer.java Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.
← Edit watut_1.21.6 - Copy\src\main\java\com\corosus\watut\client\screen\ScreenParticleRenderer.java Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.
→ Read watut_1.21.6 - Copy\src\main\java\com\corosus\watut\client\screen\ScreenParticleRenderer.java [offset=350]
Id love to get this thing to work, not sure whats wrong.
Interesting_Key3421@reddit
I don't think i have issues.. did you improve the situation re-downloading the weights?
fyvehell@reddit
Ah shit, here we go again.
nickm_27@reddit
I believe the only change requiring a recreation of GGUF is the
convertPR for the bossimracerman@reddit
It’s nice to see your Frigate-nvr Guru also interested in LLMs!
BillDStrong@reddit
There were some UD quants that were up converting to BF16 that weren't superposed to, so they uploaded some new ones fixing that. You might need to download those, if you were affected. The new models are now close to the same speed as the Bart quants.
That is a separate issue from the llama.cpp changes.
jmprog@reddit
Any chance you could add the speculative decoding?
FrozenFishEnjoyer@reddit
I need the heretic uncensored version of this now. Anyone got the updated version?
ML-Future@reddit
Thanks!