It looks like we’ll need to download the new Gemma 4 GGUFs

[-]

Skyline34rGt@reddit

Better question, do we need new heretic versions too + quant of them

[-]

Maleficent-Ad5999@reddit

Heretic abliterated Claude opus distilled would be fine

[-]

Dwansumfauk@reddit

Probably just the quants since training doesn't use llama.cpp

[-]

Heretic uses Transformers, so the safetensors files made with Heretic should be fine. They only need to be re-quantized to GGUF because the llama.cpp implementation of Gemma 4 had a few bugs initially.

[-]

Curious-Still@reddit

Do the bartowski versions need updating too or just unsloth?

[-]

noneabove1182@reddit

I don't think mine need updates desperately since my most recent upload, many people told me mine were the only ones behaving normally after I uploaded post-tokenizer fix

I'll be investigating with latest changes, but I'll take my time with it so that people don't have to redownload too many times when it's already in a decent state

[-]

IrisColt@reddit

Get my insta-upvote, king.

[-]

Vytennis@reddit

you are the GOAT! kudos from Lithuania

[-]

fettpl@reddit

I'm just taking the opportunity to say you are doing a solid piece of an amazing work. Thank you.

[-]

jld1532@reddit

Good man

[-]

danielhanchen@reddit

Yes everyone needs to reconvert if they used imatrix since the activation patterns are now different

[-]

ecompanda@reddit

the note about activation patterns changing with imatrix is the part that actually matters here, not the BOS token flag. BOS is runtime configurable but a stale imatrix means the quantization itself is optimized for the wrong patterns.

for people running the 31B: is the ppl difference large enough to justify the redownload, or is this mostly a correctness fix that only shows up in edge cases

[-]

ArtArtArt123456@reddit

i didn't download the previous ones but i plugged this into one of my qwen vision workflows as-is and it worked right out of the box and was much better at the task too. pretty pleasantly surprised here.

[-]

chickN00dle@reddit

Personally I found that Qwen3.5 was better at vision tasks / visual understanding compared to Gemma 4 by quite a margin. Both MOE versions at q5_k_m.

[-]

Top-Rub-4670@reddit

I found the same on my side. Both Qwen 3.5 9B and 35B-A3B do better than Gemma 4 26B-A4B in my vision tests.

When I give hints to Gemma 4 it seems to understand as well as Qwen 3.5, but it's not clear if it's just being a sycophant or if vision is good but hindered by bad initial communication.

[-]

nickm_27@reddit

Unfortunately it seems Gemma4 was not trained on video, so temporal tasks Gemma4 is quite bad at unfortunately

[-]

WhoRoger@reddit

The smaller models should know video, afaik

[-]

nickm_27@reddit

sorry, not sure what you mean, my comment was talking about what the vision was trained on not if it functions at all

[-]

WhoRoger@reddit

Oops my bad

[-]

RanklesTheOtter@reddit

I literally just finished a fine tune, time to start over. 🤬

[-]

jacek2023@reddit (OP)

you finetune gguf?

[-]

RanklesTheOtter@reddit

No haha I just can't read English. I thought Google was broke it the original model and uploaded a fix. 😅

I guess I'm ok then.

[-]

signal_overdose@reddit

Just use bartowski's GGUFs instead so you don't have to update your models every week...

Most popular does not mean best quality...

[-]

Iory1998@reddit

I see that even the 31B was updated
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/tree/main

[-]

the-orange-joe@reddit

Is the Q8 quant not affected? It has not been updated.

[-]

FrodeHaltli@reddit

The Q8 quant was the only one that wouldn't spit out gibberish for me.

[-]

grumd@reddit

You using cuda 13.2?

[-]

FrodeHaltli@reddit

I am on 12.4

[-]

Dany0@reddit

Can't we get the MTP from LiteRL?

[-]

segmond@reddit

No biggie, I now expect to download any new model 3x-5x before it becomes stable. if it's a big one, I usually wait for about a week. For example, I'm waiting till the weekend before I begin downloading GLM5.1

[-]

Dany0@reddit

Wish LM studio would automatically notice this

[-]

Final-Rush759@reddit

GLM5.1 is 5.0 with more post training.

[-]

shockwaverc13@reddit

this is the llama 3 tokenizer issue all over again

[-]

----Val----@reddit

When you live on the bleeding edge, expect to get cut.

[-]

Borkato@reddit

[-]

Caffdy@reddit

wtf, this meme is too deep-fried for my fried-mind. Would love some help here explaining it hahaha

[-]

Borkato@reddit

Hahaha it’s a reaction image basically saying “this must be the worst comment ever” but phrased as “is this the worst comment ever?”. And then it erases worse and puts best lol

[-]

MoffKalast@reddit

This is why I wait at least 2 weeks before trying out models.

[-]

AlwaysLateToThaParty@reddit

Without knowing the architecture before it's released, this will always be an issue. Anyone can always use the models directly using source. These inference handlers, like llama.cpp, exist to present a common interface. But the connections of that interface to the models will depend upon the achitecture of those models. If you want to use a common interface for different models, the interface will need to be updated when the model architecture adds new features.

[-]

SnooPaintings8639@reddit

I feel like there is "something" that needs fixing in GGUFs couple of days after each of new models' batch drop. It seems like the early-tests-and-opinions of any model are rarely objective. It's just wise to wait a week and let it finish all the technical evaluations.

[-]

Borkato@reddit

And the qwen 3.5 issue all over again

[-]

jinnyjuice@reddit

What was it, exactly?

[-]

sersoniko@reddit

And the GPT-OSS issue all over again

[-]

some_user_2021@reddit

The user is criticizing OpenAI's policies, this is against my guidelines. I'm sorry, I can't help you with that. I cannot tolerate hate speech and violence.

[-]

Borkato@reddit

Lol violence

[-]

xXprayerwarrior69Xx@reddit

[-]

a_beautiful_rhind@reddit

So I should reconvert the 31b as well?

[-]

danielhanchen@reddit

It's ongoing!

[-]

Flashy_Management962@reddit

thank you for your work btw! One little question though, is it normal that I get with your iq4nl gemma 26b this perplexity "Final estimate: PPL over 576 chunks for n_ctx=512 = 26296.2393 +/- 532.75059" - with the bartowski i get around 200.

[-]

jacek2023@reddit (OP)

I am trying to understand why listed changes require new GGUF and why 31b is not updated, but I see GGUFs for 26B and E2B are new

[-]

mtomas7@reddit

I noticed that with the last llama.cpp that was shipped with LMStudio 0.4.9.1 the <|think|> token stopped working per your manual.

[-]

Polaris_debi5@reddit

All models have just been updated, from e2b to 31B. They are now available at huggingface.

[-]

Kitchen-Year-8434@reddit

These all looked server-side to me too.

[-]

a_beautiful_rhind@reddit

Well I'm reconverting anyway.. the only difference so far is that it sets the BOS token to true. Hopefully there are other differences.

I was smart to just d/l the full model.

[-]

Borkato@reddit

Does this mean I need to download the heretic versions again in like 3 days once they finally come out

[-]

a_beautiful_rhind@reddit

For the 31b set the BOS token to true from the command line or inside the metadata.

    --override-kv tokenizer.ggml.add_bos_token=bool:true

[-]

dampflokfreund@reddit

That will affect it only if you are not using the chat template since that adds the bos already.

[-]

a_beautiful_rhind@reddit

I think on mainline llama.cpp it added it to text completions but in IK_llama it didn't.

[-]

Borkato@reddit

👀 cool, thank you! What about the other versions

[-]

a_beautiful_rhind@reddit

No idea. Only used the 31b.

[-]

Borkato@reddit

No worries! Thank you!

[-]

koygocuren@reddit

What about qwen 3.5 moe models?

[-]

Existing_Director_48@reddit

Here shows there are two versions on unsloth studio for the 26b, one all caps other small caps. I dont know what to pick.

[-]

Quozul@reddit

I've set up a crontab to re-download models every day, just in case.

[-]

Ambitious_Ad4397@reddit

Does it support Turboquant?

[-]

jacek2023@reddit (OP)

TurboQuant, OpenClaw, BitCoin and Taylor Swift

[-]

Ambitious_Ad4397@reddit

What's wrong with that question?

[-]

dinerburgeryum@reddit

Lack of detail on what you're hoping to accomplish.

[-]

Ambitious_Ad4397@reddit

Use turboquant with Gemma, to reduce memory usage by context

[-]

dinerburgeryum@reddit

These are llama.cpp specific changes, and llama.cpp does not support TurboQuant as an option for KV cache compression. So no, TurboQuant is not relevant to this discussion.

[-]

nickm_27@reddit

A particular model does not have support for TurboQuant, all models support TurboQuant if your inference provider supports it. Currently I don't believe any of them have an actual TurboQuant implementation. So asking if a particular model supports TurboQuant is a generally incorrect question to ask.

[-]

Ambitious_Ad4397@reddit

So, for example, If ollama would support TurboQuant, then all models that ollama can work with would support TurboQuant?

[-]

nickm_27@reddit

yes, just like Q4_0, Q8_0 etc

[-]

erazortt@reddit

TurboQuant is used for the KV cache. There is not relation of the KV cache quant to the model used. If KV chache is supported or not is only dependent by the implementation of inference software.

[-]

PunnyPandora@reddit

they're just redditing

[-]

learn_and_learn@reddit

I even heard that it's got electrolytes

[-]

Cool-Chemical-5629@reddit

But can it run Crysis?

[-]

MaxKruse96@reddit

in that order

[-]

a_beautiful_rhind@reddit

Hannah Montana linux!

[-]

PunnyPandora@reddit

no. wait until the schizos figure it out https://github.com/ggml-org/llama.cpp/discussions/20969

[-]

Long_War8748@reddit

This is why I wait at least 2 weeks before touching new models 😅....

[-]

popoppypoppylovelove@reddit

I downloaded gemma-4-26B-A4B-it-Q8_0.gguf yesterday, but this file wasn't updated? were Q8_0 and UD-Q8_K_XL not affected by the changes?

[-]

beneath_steel_sky@reddit

Uh oh. Waiting for /u/noneabove1182 :-)

[-]

noneabove1182@reddit

I may update them but in practice my quants have seemed to already been working very well for people, and most of these fixes seem to be run-time related rather than conversion

only one that's conversion-time is the BOS token and even that one is confusing with people reporting results in both directions (improve and detriment)

[-]

mark_haas@reddit

daniel said the activation patterns have changed so he had to redo the imatrix ones, does this affect your quants?

[-]

noneabove1182@reddit

That was mainly the tokenizer issues that I already have fixed in my latest uploads

I'll definitely be giving it a look to double check but for now I've received numerous reports that mine were the only ones working, by now maybe others work too but I'm pretty confident mine are fine for now and I can take a bit more time to double check everything before spamming uploads

[-]

mark_haas@reddit

thanks

[-]

WhoRoger@reddit

Is this why my E4B keeps invalidating the KV cache and reloading context? The console output said something about SWA. I've no idea what's going on :P

[-]

c64z86@reddit

Sorry if this is answered somewhere and I just don't understand it, but do the new versions fix the bug where Gemma 4.just outright stops responding after a while?

[-]

Mashic@reddit

Do these changes come with any speed, memory, or accuracy improvements?

[-]

Ok_Mammoth589@reddit

Nope, they're just twiddling bits for no reason

[-]

__Captain_Autismo__@reddit

Not sure what's been changing but at first they were unusable in my harness and now these models are #1 for web dev work out of what I tried locally. Just did an internal shootout yesterday.

This was before the newest update.

Awesome to have tools that level up without having to do anything.

Surprisingly 31b at q8 did better than b16

[-]

dampflokfreund@reddit

No we do not need to download new GGUFs. These PRs are just on the inference side, even number 4 is still compatible with old ggufs.

[-]

danielhanchen@reddit

I would have wished if this was the case, but for imatrix ones I had to redo them since the activation patterns have changed

[-]

FluoroquinolonesKill@reddit

What about the non-imatrix ones? Do those need to be re-downloaded? E.g., UD-Q8_K_XL and UD-Q4_K_XL.

[-]

gigaflops_@reddit

I feel like in the age of multi-gigabit internet connections, it's usually easier for most people to redownload the whole thing

[-]

Velocita84@reddit

For the sake of being thorough the BOS thing can also be solved by a simple metadata edit uvx --from gguf gguf-set-metadata [path to model] tokenizer.ggml.add_bos_token true

[-]

segmond@reddit

yeah, we need to download new GGUFs, they regenerated them and improved them.

[-]

Borkato@reddit

Oh, this is lovely if true

[-]

Corosus@reddit

Latest llama.cpp Latest fresh downloaded gemma-4-26B-A4B-it-UD-Q4_K_M.gguf Latest opencode launched in powershell

← Edit watut_1.21.6 - Copy\src\main\java\com\corosus\watut\client\screen\ScreenParticleRenderer.java Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.

→ Read watut_1.21.6 - Copy\src\main\java\com\corosus\watut\client\screen\ScreenParticleRenderer.java [offset=357]

← Edit watut_1.21.6 - Copy\src\main\java\com\corosus\watut\client\screen\ScreenParticleRenderer.java Could not find oldString in the file. It must match exactly, including whitespace, indentation, and line endings.

→ Read watut_1.21.6 - Copy\src\main\java\com\corosus\watut\client\screen\ScreenParticleRenderer.java [offset=350]

Id love to get this thing to work, not sure whats wrong.

[-]

Interesting_Key3421@reddit

I don't think i have issues.. did you improve the situation re-downloading the weights?

[-]

fyvehell@reddit

Ah shit, here we go again.

[-]

nickm_27@reddit

I believe the only change requiring a recreation of GGUF is the convert PR for the bos

[-]

simracerman@reddit

It’s nice to see your Frigate-nvr Guru also interested in LLMs!

[-]

BillDStrong@reddit

There were some UD quants that were up converting to BF16 that weren't superposed to, so they uploaded some new ones fixing that. You might need to download those, if you were affected. The new models are now close to the same speed as the Bart quants.

That is a separate issue from the llama.cpp changes.

[-]